AST-driven near-duplicate code detector for Python. Finds copy-paste clones even when variables are renamed. Built for AI-assisted refactoring workflows.
$ duphunter-near-dup src/ --output text --threshold 0.76
src/api/users.py:45 get_user_by_id
src/api/orders.py:89 get_order_by_id
similarity: 0.91 | 12 lines each | 3 AST diffs
Linters catch style issues. Type checkers catch type errors. DupHunter catches structural duplication — the kind that leads to bugs when you fix something in one copy but forget the other.
- AST normalization: Renames local variables to positional placeholders, so
get_user_by_idandget_order_by_idmatch even with different parameter names - MinHash + LSH indexing: Scales to large codebases without O(N^2) pairwise comparison
- Literal normalization: Finds "same logic, different data" clones (strings, numbers normalized)
- Cluster output: Groups transitive duplicates — if A
B and BC, reports {A, B, C} - MCP server: AI assistants can search for duplicates directly
# Global install (recommended)
pipx install duphunter
# Or with pip
pip install duphunter# Scan entire project
duphunter-near-dup .
# Focused search: "what's similar to the function at line 120?"
duphunter-near-dup . --query src/app.py:120 --top 20
# Human-readable output
duphunter-near-dup . --output text
# Cluster transitive duplicates
duphunter-near-dup . --cluster --output text
# Adjust sensitivity
duphunter-near-dup . --threshold 0.72 # more results, looser matching
duphunter-near-dup . --sensitivity high # catch small 4-5 line helpers
# Ignore known patterns
duphunter-near-dup . --exclude-functions "test_*,setup_*"
duphunter-near-dup . --exclude-functions "re:^get_.*_by_.*$"Add to ~/.claude.json or .mcp.json:
{
"mcpServers": {
"duphunter": {
"command": "duphunter-mcp"
}
}
}MCP tool: search_near_duplicates — accepts the same options as CLI flags (as JSON fields), returns structured match payload. Default sensitivity is high in MCP mode.
- Parse: Python AST extracts all function and class bodies
- Normalize: Local identifiers → positional placeholders (
arg0,local1), literals → type tokens - Fingerprint: Normalized AST tokens → MinHash signatures (128 permutations)
- Index: LSH bands for fast candidate retrieval
- Compare: Candidate pairs scored by Jaccard similarity on AST token shingles
- Report: Pairs above threshold, optionally clustered by transitivity
Filter out known wrappers, test helpers, or generated code:
| Format | Example | Matches |
|---|---|---|
| Exact name | approve_suggestion |
Only that function |
| Glob pattern | get_*_by_* |
get_user_by_id, get_order_by_name, ... |
| Regex | re:^test_.*_integration$ |
test_api_integration, test_db_integration, ... |
Multiple rules: comma-separated or repeated flags.
| Use case | Command |
|---|---|
| General scan | duphunter-near-dup . --threshold 0.76 |
| Refactoring prep | duphunter-near-dup . --query FILE:LINE --top 20 |
| Catch small helpers | duphunter-near-dup . --sensitivity high |
| CI integration | duphunter-near-dup . --threshold 0.85 --output json |
- Python 3.10+
- No external dependencies (stdlib only — uses
ast,hashlib,collections)
MIT