Haichao Zhu*, Qian Zhang*, Jiyuan Wang, Zhaorui Yang, and Yuxin Qiu (* indicates equal contribution)
| arXiv | Project Page |
TL;DR: Needle in the Repo (NITR) is a C++ repository-level benchmark for evaluating whether AI-generated repository edits preserve maintainable structure, not just behavioral correctness. It comprises 22 curated C++ repository probes across nine maintainability dimensions, pairing natural multi-file change requests with hidden functional tests and structural oracles. The benchmark is designed to expose cases where an agent produces behaviorally correct code that still introduces maintainability failures such as weak modularity, poor testability, or architectural shortcutting.
Figure: Pass/fail heatmap of 23 evaluated configurations across 21 cases from the paper.
This repository contains the public benchmark release:
- 22 starter cases under
cases/ - case specifications and design docs under
docs/ - public evaluator code under
evaluator/ - agent-facing task statements (
TASK.md,TASK1.md, ...) - vendored dependencies required to build cases and evaluators
This repository includes local submission helpers under submit/ for
benchmark automation, but it does not include a hosted submission service.
.
|-- cases/
|-- docs/
|-- evaluator/
|-- third_party/
| |-- googletest/
| |-- eigen3/
| `-- json/
|-- CMakeLists.txt
`-- .gitignore
Each case directory contains starter source code plus one or more task files.
Multi-step cases use TASK1.md, TASK2.md, and TASK3.md. Single-step cases
use TASK.md. The docs/ directory contains case specifications and
supporting benchmark design materials. The evaluator/ directory contains the
public checks, tests, and fixtures used by this repository.
If you want to add a new NITR case to this repository, see
HOW_TO_CREATE_CASE.md. It describes how to choose a
design dimension, write SPEC.md and TASK.md, add starter code under
cases/, add evaluator tests and structural checks under evaluator/, and
prepare the pull request to merge the new case into main.
Local submission helpers now live under submit/README.md.
They can run cases against supported model backends, materialize generated case
directories, and evaluate them locally with the public NITR evaluator.
For an end-to-end usage guide, including single-case submit, batch submit, and
local evaluation, see HOW_TO_SUBMIT.md.
Python package setup for the submit tooling is documented there as well.
Configure all cases:
cmake -S . -B buildConfigure one case plus its evaluator:
cmake -S . -B build \
-DNITR_BUILD_ALL_CASES=OFF \
-DNITR_CASE=003.reuse-existing-code \
-DNITR_BUILD_EVALUATOR=ONConfigure one case only:
cmake -S . -B build -DNITR_BUILD_ALL_CASES=OFF -DNITR_CASE=001.add-no-callsite-spreadSome starter cases are intentionally incomplete and may not compile before the task is solved. The public corpus is meant to expose the starter state, not a fully solved build.
For local development, manual debugging, or quick validation of one case, you can configure and build a single case with:
python3 tools/run_case.py 002.refactor-and-resueTo build a single case and run its public evaluator checks locally:
python3 tools/run_case.py 002.refactor-and-resue --with-evaluatorUse tools/run_case.py when you want to work on one case directly inside the
repository. Use the tooling under submit/ when you want to
materialize generated submission outputs under .submit-output/ and evaluate
those results separately.
This public repository does not include a hosted submission service. In the
open release, a "submission" means producing repository edits for a selected
case and then evaluating the result locally with the provided public evaluator
or the helper scripts under submit/.
Typical workflow:
# 1. Pick a case and read its task file(s).
# 2. Ask your coding system to edit files under cases/<case_slug>/.
# 3. Run the public evaluator locally.
python3 tools/run_case.py 002.refactor-and-resue
python3 tools/run_case.py 002.refactor-and-resue --with-evaluatorYou can use the benchmark through several interfaces:
- API workflow: send the selected case directory and its
TASK.mdorTASK1.md/TASK2.md/TASK3.mdfiles to your coding agent through an API, apply the returned file edits inside this repository, and then runpython3 tools/run_case.py <case_slug> --with-evaluator. - Agent CLI workflow: open this repository in an agentic coding tool, point the agent at a specific case, ask it to complete the requested edits in place, and then run the same local evaluator command.
- Web chat workflow: if you use a browser-based coding assistant, provide the relevant case files and task statement, copy the generated edits back into the repository, and then run the local evaluator command here.
For multi-step cases, apply the task files in order. That is, complete
TASK1.md first, continue from the resulting code state to TASK2.md, and so
on before running the evaluator.
001.add-no-callsite-spread002.refactor-and-resue003.reuse-existing-code004.cv-srp005.pricing-ocp006.gs-isp007.ml-lsp-multistep008.map-dip009.session-expiry-testability010.logging-side-effects011.config-sprawl012.cache-lifecycle013.stable-public-api014.report-export-ocp015.pipeline-provider-decoupling016.device-segment-planner017.active-snapshot-lifecycle018.seeded-selection-testability019.ranking-explainability-boundary020.handover-packet-ownership-boundary021.inline-filter-entrypoint-reuse022.thermostat-sensor-decoupling
If you use NITR in your research, please cite the following paper:
@misc{zhu2026nitr,
title = {Needle in the Repo: A Benchmark for Maintainability in AI-Generated Repository Edits},
author = {Haichao Zhu and Qian Zhang and Jiyuan Wang and Zhaorui Yang and Yuxin Qiu},
year = {2026},
eprint = {2603.27745},
archivePrefix = {arXiv},
primaryClass = {cs.SE},
url = {https://arxiv.org/abs/2603.27745}
}For GitHub citation metadata, see CITATION.cff.
If you want to contribute cases, evaluator updates, or tooling changes, see CONTRIBUTING.md.


