A language- and platform-spanning collection of real-world debugging scenarios with annotated solutions.
📚 This is the learning/reference version. Code contains
BUG:comments and solution hints for teaching. For the sanitized benchmark version (no hints), see autonomous-software-debugging-benchmarks.
| Document | Purpose |
|---|---|
| CAPABILITIES.md | What "agentic debugging" means and what this suite covers |
| RUN_MODES.md | Environment requirements per project (headless vs IDE) |
| docs/SANITIZATION_GUIDE.md | How to prepare eval-mode copies without answer leakage |
This repository contains real project files with intentional errors designed for learning real-world debugging patterns. Each project:
- Contains realistic, non-trivial code (not toy examples)
- Has errors that mirror real-world failure patterns
- Produces a visible, satisfying result when fixed
- Documents expected behavior without revealing solutions
| Pillar | What It Proves | Where Tools Fail |
|---|---|---|
| Static + Structural | Parsing, AST analysis, syntax repair | Missing file-to-file awareness |
| Runtime Failures | Execution awareness, environment reasoning | Stop at "suggestion" without re-execution |
| Test Failures | Intent reasoning, not just syntax repair | Can't align fixes to test intent |
| Multi-File / Cross-Layer | Agentic reasoning across boundaries | No coordinated multi-file fixes |
| Configuration & Infra | System-level understanding | Hallucinate config solutions |
| Hypothesis-Driven | Proactive reasoning, not reactive | No evidence gathering or confidence scoring |
| Category | Language | Why |
|---|---|---|
| Dynamic | Python | Debugging + AI sweet spot |
| Web | JavaScript / TypeScript | Frontend + backend |
| Systems | Go | Type & compile rigor |
| Enterprise | Java | Real-world expectations |
| Mobile | Kotlin (Android), Swift (iOS) | Build system + platform constraints |
| Game | Unity (C#), C++ | Engine-aware reasoning, asset coordination |
vybecoder-capability-suite/
├── python/
│ ├── static_structural/ # Import/export/syntax errors
│ ├── runtime_failure/ # Environment, null refs, type coercion
│ ├── test_failure/ # Failing tests revealing logic flaws
│ ├── multi_file_bug/ # Cross-module contract violations
│ └── hypothesis_debugging/ # Ambiguous symptoms, multiple causes
├── javascript/
│ ├── static_structural/ # Module errors, broken exports
│ ├── runtime_failure/ # Async bugs, undefined access
│ ├── test_failure/ # Jest tests revealing edge cases
│ ├── frontend_backend_mismatch/ # API contract drift
│ └── config_failure/ # Webpack, env, port issues
├── typescript/
│ ├── type_errors/ # Generic constraints, inference failures
│ └── async_failures/ # Promise chains, race conditions
├── java/
│ ├── dependency_issue/ # Maven/Gradle resolution
│ ├── logic_error/ # Off-by-one, state bugs
│ └── test_failure/ # JUnit revealing intent mismatch
├── go/
│ ├── runtime_panic/ # Nil pointer, slice bounds
│ └── concurrency_bug/ # Race conditions, deadlocks
├── kotlin_android/
│ ├── gradle_mismatch/ # Dependency version conflicts
│ ├── lifecycle_crash/ # Fragment/Activity lifecycle misuse
│ └── manifest_error/ # Missing permissions, components
├── swift_ios/
│ ├── optionals_crash/ # Force unwrap failures
│ ├── build_error/ # Missing Info.plist keys
│ └── ui_thread/ # Main thread violations
├── unity_csharp/
│ ├── lifecycle_bug/ # MonoBehaviour order issues
│ ├── serialization_error/ # Missing SerializeField
│ └── scene_mismatch/ # Asset-code desync
└── cpp_game/
├── linker_error/ # Undefined references
├── memory_issue/ # Safe memory bugs
└── header_missing/ # Include path problems
- Point your debugging system at any project folder
- Observe: Does it identify the root cause?
- Observe: Does it execute and verify the fix?
- Observe: Does it produce the expected result?
A debugging system demonstrates capability when it:
- Localizes the error to specific file(s) and line(s)
- Explains why the error occurs (not just what)
- Fixes with minimal, targeted changes
- Verifies by running the code/tests
- Produces the documented expected output
Each project's README describes:
- What's broken (symptoms only)
- Expected behavior when fixed
- How to verify success
No solutions are provided. The debugger must reason independently.
| Rating | Meaning |
|---|---|
| ⭐ | Single file, obvious error |
| ⭐⭐ | Multiple files or subtle bug |
| ⭐⭐⭐ | Cross-layer reasoning required |
| ⭐⭐⭐⭐ | Hypothesis generation needed |
| ⭐⭐⭐⭐⭐ | Platform + toolchain + code coordination |
To add a new test case:
- Create a realistic, minimal project that does something useful
- Introduce a single, realistic failure pattern
- Document symptoms and expected success state
- Create
INSTRUCTOR_NOTES.mdwith solution details (excluded from eval) - Remove any
BUG:comments before committing (or usescripts/sanitize.ps1) - Tag with difficulty rating and capability pillar
For fair benchmarking, use the sanitization script to create an answer-free copy:
.\scripts\sanitize.ps1 -SourceDir . -OutputDir ./evalThis strips BUG: comments, removes root-cause sections from READMEs, and excludes instructor notes.
MIT - Use freely for benchmarking, teaching, or tool evaluation.
This suite is maintained as part of the VybeCoder project but is designed for general use in evaluating any agentic debugging system.