Autonomous Software Debugging Benchmarks

A language- and platform-spanning benchmark suite for evaluating autonomous debugging systems.

⚠️ This is the evaluation version. Code contains no hints or solution markers. For learning/reference with annotated bugs, see real-world-debugging-examples.

Quick Links

Document	Purpose
CAPABILITIES.md	What "agentic debugging" means and what this suite proves
EVALUATION.md	Scoring rubric for judging success
RUN_MODES.md	Environment requirements per project (headless vs IDE)

Purpose

This repository contains real project files with purposeful errors designed to test and calibrate agentic debugging capabilities. Each project:

Contains realistic, non-trivial code (not toy examples)
Has errors that mirror real-world failure patterns
Produces a visible, satisfying result when fixed
Documents expected behavior without revealing solutions

What This Suite Tests

The 6 Capability Pillars

Pillar	What It Proves	Where Tools Fail
Static + Structural	Parsing, AST analysis, syntax repair	Missing file-to-file awareness
Runtime Failures	Execution awareness, environment reasoning	Stop at "suggestion" without re-execution
Test Failures	Intent reasoning, not just syntax repair	Can't align fixes to test intent
Multi-File / Cross-Layer	Agentic reasoning across boundaries	No coordinated multi-file fixes
Configuration & Infra	System-level understanding	Hallucinate config solutions
Hypothesis-Driven	Proactive reasoning, not reactive	No evidence gathering or confidence scoring

Language Coverage

Category	Language	Why
Dynamic	Python	AI/ML ecosystem, debugging sweet spot
Web	JavaScript / TypeScript	Frontend + backend
Systems	Go	Type & compile rigor
Enterprise	Java	Real-world expectations
Mobile	Kotlin (Android), Swift (iOS)	Build system + platform constraints
Game	Unity (C#), C++	Engine-aware reasoning, asset coordination

Repository Structure

autonomous-software-debugging-benchmarks/
├── python/
│   ├── static_structural/      # Import/syntax errors
│   ├── runtime_failure/        # Environment, null refs, type coercion
│   ├── test_failure/           # Failing tests revealing logic flaws
│   ├── multi_file_bug/         # Cross-module contract violations
│   └── hypothesis_debugging/   # Ambiguous symptoms, multiple causes
├── javascript/
│   ├── static_structural/      # Module errors, broken exports
│   ├── runtime_failure/        # Async bugs, undefined access
│   ├── test_failure/           # Jest tests revealing edge cases
│   ├── frontend_backend_mismatch/  # API contract drift
│   └── config_failure/         # Env, port, config issues
├── typescript/
│   ├── type_errors/            # Generic constraints, inference failures
│   └── async_failures/         # Promise chains, race conditions
├── java/
│   ├── dependency_issue/       # Maven resolution
│   ├── logic_error/            # Off-by-one, state bugs
│   └── test_failure/           # JUnit revealing intent mismatch
├── go/
│   ├── runtime_panic/          # Nil pointer, slice bounds
│   └── concurrency_bug/        # Race conditions, deadlocks
├── kotlin/
│   └── android_lifecycle/      # Activity/Fragment lifecycle issues
├── swift/
│   ├── optionals_crash/        # Force unwrap failures
│   ├── build_error/            # Missing Info.plist keys
│   └── ui_thread/              # Main thread violations
├── unity/
│   ├── lifecycle_bug/          # MonoBehaviour order issues
│   ├── serialization_error/    # Missing SerializeField
│   └── scene_mismatch/         # Asset-code desync
├── cpp/
│   ├── linker_error/           # Undefined references
│   ├── memory_issue/           # Safe memory bugs
│   └── header_missing/         # Include path problems
└── cross_domain/
    └── unity_node_contract/    # Multi-language integration

How to Use This Suite

For Evaluation

Point your debugging system at any project folder
Observe: Does it identify the root cause?
Observe: Does it execute and verify the fix?
Observe: Does it produce the expected result?

Success Criteria

A debugging system demonstrates capability when it:

Localizes the error to specific file(s) and line(s)
Explains why the error occurs (not just what)
Fixes with minimal, targeted changes
Verifies by running the code/tests
Produces the documented expected output

What Success Looks Like

Each project's README describes:

What's broken (symptoms only)
Expected behavior when fixed
How to verify success

No solutions are provided. The debugger must reason independently.

Difficulty Ratings

Rating	Meaning	Example Projects
⭐	Single file, obvious error	—
⭐⭐	Normal bugs developers hit daily	`python/test_failure`, `java/test_failure`, `javascript/test_failure`, `typescript/type_errors`
⭐⭐⭐	Cross-layer reasoning required	`python/runtime_failure`, `javascript/runtime_failure`
⭐⭐⭐⭐	Hypothesis generation needed	`python/multi_file_bug`, `go/concurrency_bug`, `typescript/async_failures`
⭐⭐⭐⭐⭐	Platform + toolchain + code	`cross_domain/unity_node_contract`, `cpp/header_missing`

Difficulty Distribution

⭐      ░░░░░░░░░░░░░░░░░░░░  0%   (baseline, not included)
⭐⭐    ████████░░░░░░░░░░░░  35%  (credibility anchors)
⭐⭐⭐  ██████░░░░░░░░░░░░░░  25%  (intermediate)
⭐⭐⭐⭐ ██████░░░░░░░░░░░░░░  25%  (advanced)
⭐⭐⭐⭐⭐████░░░░░░░░░░░░░░░░  15%  (expert)

Contributing

To add a new test case:

Create a realistic, minimal project that does something useful
Introduce a single, realistic failure pattern
Document symptoms and expected success state
Create INSTRUCTOR_NOTES.md with solution details (excluded from eval)
Tag with difficulty rating and capability pillar

Evaluation Mode

For fair benchmarking, use the sanitization script to create an answer-free copy:

.\scripts\sanitize.ps1 -SourceDir . -OutputDir ./eval

License

MIT - Use freely for benchmarking, teaching, or tool evaluation.

This suite is designed for general use in evaluating any autonomous debugging system.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Autonomous Software Debugging Benchmarks

Quick Links

Purpose

What This Suite Tests

The 6 Capability Pillars

Language Coverage

Repository Structure

How to Use This Suite

For Evaluation

Success Criteria

What Success Looks Like

Difficulty Ratings

Difficulty Distribution

Contributing

Evaluation Mode

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
cpp		cpp
cross_domain/unity_node_contract		cross_domain/unity_node_contract
docs		docs
go		go
java		java
javascript		javascript
kotlin/android_lifecycle		kotlin/android_lifecycle
python		python
scripts		scripts
swift		swift
typescript		typescript
unity		unity
.gitignore		.gitignore
CAPABILITIES.md		CAPABILITIES.md
EVALUATION.md		EVALUATION.md
README.md		README.md
RUN_MODES.md		RUN_MODES.md

Folders and files

Latest commit

History

Repository files navigation

Autonomous Software Debugging Benchmarks

Quick Links

Purpose

What This Suite Tests

The 6 Capability Pillars

Language Coverage

Repository Structure

How to Use This Suite

For Evaluation

Success Criteria

What Success Looks Like

Difficulty Ratings

Difficulty Distribution

Contributing

Evaluation Mode

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages