-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Problem
The bot detection heuristic in github_client.py uses suffix-based regex to catch bot accounts:
_BOT_SUFFIX_RE = re.compile(r"(\[bot\]|-bot|_bot|-app)$", re.IGNORECASE)This requires a delimiter before "bot" (-bot, _bot, [bot]). It misses {project}bot patterns where "bot" is concatenated directly onto the project name without a delimiter, e.g.:
pytorchbot(500 merged PRs across 3 repos, scored MEDIUM instead of BOT)- Other potential examples:
chromiumbot,tensorflowbot, etc.
Impact
pytorchbot has 500 merged PRs across only 3 repos and scores 0.6476 (MEDIUM) in v1. It should be classified as BOT and short-circuited to score 0.0.
Suggested Fix
Add a broader suffix check that catches logins ending in bot without requiring a delimiter, while avoiding false positives on legitimate usernames that happen to end in "bot" (e.g., abbott). Options:
- Simple suffix match with minimum project-name length: catch logins where the last 3 chars are "bot" and the prefix is >= 3 chars (avoids matching short names like "robot")
- Known-project-bot list: explicitly list known project bots (
pytorchbot, etc.) in_BOT_PREFIX_RE - GitHub user type check: the GraphQL
__typenamefield already catches GitHub App bots, butpytorchbotis a regular User account acting as a bot
Option 1 seems like the best balance of coverage and precision. Could also combine with a check for the account having very few unique repos relative to PR count (pytorchbot: 500 PRs, 3 repos).
References
- Bot detection:
src/good_egg/github_client.py:175-188 - Scorer bot short-circuit:
src/good_egg/scorer.py:44-54