Skip to content

Bot detection misses {project}bot login patterns (e.g. pytorchbot) #33

@jeffreyksmithjr

Description

@jeffreyksmithjr

Problem

The bot detection heuristic in github_client.py uses suffix-based regex to catch bot accounts:

_BOT_SUFFIX_RE = re.compile(r"(\[bot\]|-bot|_bot|-app)$", re.IGNORECASE)

This requires a delimiter before "bot" (-bot, _bot, [bot]). It misses {project}bot patterns where "bot" is concatenated directly onto the project name without a delimiter, e.g.:

  • pytorchbot (500 merged PRs across 3 repos, scored MEDIUM instead of BOT)
  • Other potential examples: chromiumbot, tensorflowbot, etc.

Impact

pytorchbot has 500 merged PRs across only 3 repos and scores 0.6476 (MEDIUM) in v1. It should be classified as BOT and short-circuited to score 0.0.

Suggested Fix

Add a broader suffix check that catches logins ending in bot without requiring a delimiter, while avoiding false positives on legitimate usernames that happen to end in "bot" (e.g., abbott). Options:

  1. Simple suffix match with minimum project-name length: catch logins where the last 3 chars are "bot" and the prefix is >= 3 chars (avoids matching short names like "robot")
  2. Known-project-bot list: explicitly list known project bots (pytorchbot, etc.) in _BOT_PREFIX_RE
  3. GitHub user type check: the GraphQL __typename field already catches GitHub App bots, but pytorchbot is a regular User account acting as a bot

Option 1 seems like the best balance of coverage and precision. Could also combine with a check for the account having very few unique repos relative to PR count (pytorchbot: 500 PRs, 3 repos).

References

  • Bot detection: src/good_egg/github_client.py:175-188
  • Scorer bot short-circuit: src/good_egg/scorer.py:44-54

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions