A language- and platform-spanning benchmark suite for evaluating autonomous debugging systems.
⚠️ This is the evaluation version. Code contains no hints or solution markers. For learning/reference with annotated bugs, see real-world-debugging-examples.
| Document | Purpose |
|---|---|
| CAPABILITIES.md | What "agentic debugging" means and what this suite proves |
| EVALUATION.md | Scoring rubric for judging success |
| RUN_MODES.md | Environment requirements per project (headless vs IDE) |
This repository contains real project files with purposeful errors designed to test and calibrate agentic debugging capabilities. Each project:
- Contains realistic, non-trivial code (not toy examples)
- Has errors that mirror real-world failure patterns
- Produces a visible, satisfying result when fixed
- Documents expected behavior without revealing solutions
| Pillar | What It Proves | Where Tools Fail |
|---|---|---|
| Static + Structural | Parsing, AST analysis, syntax repair | Missing file-to-file awareness |
| Runtime Failures | Execution awareness, environment reasoning | Stop at "suggestion" without re-execution |
| Test Failures | Intent reasoning, not just syntax repair | Can't align fixes to test intent |
| Multi-File / Cross-Layer | Agentic reasoning across boundaries | No coordinated multi-file fixes |
| Configuration & Infra | System-level understanding | Hallucinate config solutions |
| Hypothesis-Driven | Proactive reasoning, not reactive | No evidence gathering or confidence scoring |
| Category | Language | Why |
|---|---|---|
| Dynamic | Python | AI/ML ecosystem, debugging sweet spot |
| Web | JavaScript / TypeScript | Frontend + backend |
| Systems | Go | Type & compile rigor |
| Enterprise | Java | Real-world expectations |
| Mobile | Kotlin (Android), Swift (iOS) | Build system + platform constraints |
| Game | Unity (C#), C++ | Engine-aware reasoning, asset coordination |
autonomous-software-debugging-benchmarks/
├── python/
│ ├── static_structural/ # Import/syntax errors
│ ├── runtime_failure/ # Environment, null refs, type coercion
│ ├── test_failure/ # Failing tests revealing logic flaws
│ ├── multi_file_bug/ # Cross-module contract violations
│ └── hypothesis_debugging/ # Ambiguous symptoms, multiple causes
├── javascript/
│ ├── static_structural/ # Module errors, broken exports
│ ├── runtime_failure/ # Async bugs, undefined access
│ ├── test_failure/ # Jest tests revealing edge cases
│ ├── frontend_backend_mismatch/ # API contract drift
│ └── config_failure/ # Env, port, config issues
├── typescript/
│ ├── type_errors/ # Generic constraints, inference failures
│ └── async_failures/ # Promise chains, race conditions
├── java/
│ ├── dependency_issue/ # Maven resolution
│ ├── logic_error/ # Off-by-one, state bugs
│ └── test_failure/ # JUnit revealing intent mismatch
├── go/
│ ├── runtime_panic/ # Nil pointer, slice bounds
│ └── concurrency_bug/ # Race conditions, deadlocks
├── kotlin/
│ └── android_lifecycle/ # Activity/Fragment lifecycle issues
├── swift/
│ ├── optionals_crash/ # Force unwrap failures
│ ├── build_error/ # Missing Info.plist keys
│ └── ui_thread/ # Main thread violations
├── unity/
│ ├── lifecycle_bug/ # MonoBehaviour order issues
│ ├── serialization_error/ # Missing SerializeField
│ └── scene_mismatch/ # Asset-code desync
├── cpp/
│ ├── linker_error/ # Undefined references
│ ├── memory_issue/ # Safe memory bugs
│ └── header_missing/ # Include path problems
└── cross_domain/
└── unity_node_contract/ # Multi-language integration
- Point your debugging system at any project folder
- Observe: Does it identify the root cause?
- Observe: Does it execute and verify the fix?
- Observe: Does it produce the expected result?
A debugging system demonstrates capability when it:
- Localizes the error to specific file(s) and line(s)
- Explains why the error occurs (not just what)
- Fixes with minimal, targeted changes
- Verifies by running the code/tests
- Produces the documented expected output
Each project's README describes:
- What's broken (symptoms only)
- Expected behavior when fixed
- How to verify success
No solutions are provided. The debugger must reason independently.
| Rating | Meaning | Example Projects |
|---|---|---|
| ⭐ | Single file, obvious error | — |
| ⭐⭐ | Normal bugs developers hit daily | python/test_failure, java/test_failure, javascript/test_failure, typescript/type_errors |
| ⭐⭐⭐ | Cross-layer reasoning required | python/runtime_failure, javascript/runtime_failure |
| ⭐⭐⭐⭐ | Hypothesis generation needed | python/multi_file_bug, go/concurrency_bug, typescript/async_failures |
| ⭐⭐⭐⭐⭐ | Platform + toolchain + code | cross_domain/unity_node_contract, cpp/header_missing |
⭐ ░░░░░░░░░░░░░░░░░░░░ 0% (baseline, not included)
⭐⭐ ████████░░░░░░░░░░░░ 35% (credibility anchors)
⭐⭐⭐ ██████░░░░░░░░░░░░░░ 25% (intermediate)
⭐⭐⭐⭐ ██████░░░░░░░░░░░░░░ 25% (advanced)
⭐⭐⭐⭐⭐████░░░░░░░░░░░░░░░░ 15% (expert)
To add a new test case:
- Create a realistic, minimal project that does something useful
- Introduce a single, realistic failure pattern
- Document symptoms and expected success state
- Create
INSTRUCTOR_NOTES.mdwith solution details (excluded from eval) - Tag with difficulty rating and capability pillar
For fair benchmarking, use the sanitization script to create an answer-free copy:
.\scripts\sanitize.ps1 -SourceDir . -OutputDir ./evalMIT - Use freely for benchmarking, teaching, or tool evaluation.
This suite is designed for general use in evaluating any autonomous debugging system.