Busted

A multi-agent AI-text-detection framework with cross-examination debate.

Busted wraps published AI-generated-text (AIGT) detectors — DivEye, BENADV, DeTeCtive, DetectGPT, LLM-as-judge — inside a structured debate protocol (cross-examination, novelty gates, anti-groupthink steelman, dissent quotas). A moderator agent aggregates verdicts via skill-proportional weighted voting and issues a final ruling with consensus zones, irreducible tensions, and a minority report.

⚠️ This is an engineering case study, not a SOTA detector. See Honest Benchmark Comparison below. The single best detector (DivEye, 97.7% on HC3) outperforms our 4-agent ensemble (81.2% on a 32-text validation set). The value of this repo is in the negative results, the methodology, and the reproducible engineering log.

What this project actually delivers

A working multi-agent debate framework for AIGT detection, with FastAPI backend, WebSocket streaming, and a D3 graph-viz frontend.
A reproducible subset benchmark showing that a 4-detector ensemble beats a 7-detector ensemble (81.2% vs 78.1%) on a held-out set — evidence that "more agents ≠ better" for cross-exam protocols.
A Cohen's d feature-gate methodology that falsified four candidate detectors (MFD, BENATTEN, fractals, "semantic curvature") before integration cost.
Documented negative results — the kind that papers don't usually publish, but that save downstream researchers weeks of work.

Architecture

┌──────────────┐     ┌─────────────────────────────────────────┐
│ POST /api/    │────▶│ TEXT_INPUT node (TemporalKnowledgeGraph) │
│ analyze       │     └────────────────────┬────────────────────┘
└──────────────┘                          │ EventBus broadcast
                                          ▼
        ┌──────────────┬──────────────┬──────────────┬──────────────┐
        ▼              ▼              ▼              ▼              ▼
   DivEyeAgent    BENADVAgent   LLMJudgeAgent  DetectiveAgent  (others
   (XGBoost on   (RandomForest  (NIM Nemotron  (SimCSE+FAISS    disabled)
   surprisal     on multi-      reasoning)    KNN)
   stats)        encoder        
                 Benford)
        │              │              │              │
        └──────────────┴──────┬───────┴──────────────┘
                              ▼
              ┌─────────────────────────────────┐
              │  Phase 1: blind-first verdicts   │
              │  Phase 2: groupthink / dissent   │
              │  Phase 3: cross-examination      │
              │           (max 2 rounds, with    │
              │           novelty gate &         │
              │           PROTECTED-PAIR rule)   │
              │  Phase 4: weighted aggregation   │
              │           + FINAL_RULING node    │
              └─────────────┬───────────────────┘
                            │
                            ▼
                      WebSocket stream → frontend
                      (D3 dagre graph + per-agent cards)

The 4 active agents (diveye, benadv, llm_judge, detective) are configurable via the BUSTED_DETECTORS environment variable; the disabled ones (statistical, stylometric, logprob, plus archived mfd, benatten, zitnh, roberta_detector) remain in the codebase for further experimentation.

Subset benchmark — `more agents ≠ better`

Each row was run end-to-end (server restart, full validation pipeline) against the same 32-text validation set (20 base + 12 adversarial register- flipped). Validation set is described in tests/validation_set.py.

Subset	N	Overall	Base	Adversarial	Time/text
`solo_diveye`	1	65.6 %	70 %	58.3 %	4.7 s
`diveye + benadv`	2	65.6 %	70 %	58.3 %	12 s
`diveye + benadv + llm_judge`	3	65.6 %	75 %	50 %	16 s
`diveye + benadv + llm_judge + detective`	4	81.2 %	95 %	58.3 %	15 s
Full 7-detector ensemble	7	78.1 %	95 %	50 %	22 s

Per-text wall-clock includes debate, LLM rounds, and graph commits.

Why 1–3 agents collapse to DivEye-solo accuracy: weights were calibrated to per-detector accuracy (DivEye = 4.5, BENADV = 2.5, others ≤ 1.7). With only 1–3 agents, DivEye dominates the vote; cross-exam needs at least one counterweight set to function.

Why the 7th agent destabilizes the system: documented across six candidates (MFD, BENATTEN, ZiTNH, RoBERTa-Hello-SimpleAI, DivEye when added to a stable 6-set, and "semantic curvature"). Each one degraded adversarial recall by 8–25 pp. We hypothesise that the cross-exam protocol has a structural ceiling on ensemble size.

Reproduce: python tests/subset_benchmark.py

Methodology highlights

Cohen's d gate before integration

Every candidate feature is evaluated on 3000 HC3 samples before any classifier training. We require |Cohen's d| ≥ 0.5 on at least one feature plus an ablation against length / register confounds.

Falsified candidates (saved for posterity in docs/negative_results.md):

Candidate	Best \|d\|	Verdict
Fractals (Hurst)	0.42	Below gate
MFD register-inv.	~0.2	Below gate
BENATTEN aggregate	~0.5	Centroid CV 68 % (below ensemble)
Semantic curvature	1.44	Failed length-confound ablation

Passed candidates went into the active ensemble: BENADV (|d|=1.1), DivEye (|d|=2.84 on mean_surprisal).

Skill-proportional weights

weight = 5 × max(0.05, accuracy − 0.5) from per-detector HC3 cross- validation. The mapping pushes accurate detectors to ~4-5× the weight of mediocre ones, which is necessary once you have a strong standalone detector like DivEye but creates the small-ensemble collapse documented above.

Cross-examination debate protocol

Inspired by the Council of High Intelligence skill, adapted to AIGT detection:

Phase 1 — Blind-first: all detectors emit a DETECTION_VERDICT node without seeing each other (prevents anchoring bias).
Phase 2 — Disagreement detection: moderator detects polarity pairs and applies anti-groupthink (≥ 70 % agreement → forced steelman of opposite).
Phase 3 — Cross-exam (max 2 rounds): each dissenter must respond to ≥ 1 specific opposing evidence and introduce ≥ 1 new claim (novelty gate). Strong-pair rule (PROTECTED_PAIR) prevents the two highest-weight detectors from being flipped by majority pressure when in the minority.
Phase 4 — Weighted aggregation with explicit consensus_zones, irreducible_tensions, and minority_report in the final ruling.

Honest benchmark comparison

We do not claim state of the art. The table is what it is:

System	HC3 accuracy	Adversarial	Notes
DivEye standalone (paper)	97.7 %	n/a	Strongest single detector
Binoculars (2024)	~92 %	~80 %	Cross-perplexity ratio
DeTeCtive (NeurIPS 2024)	~91 %	~75 %	Multi-level contrastive
RADAR (2023)	~88 %	~85 %	Adversarial-trained
Busted 4-agent ensemble	81.2 %	58.3 %	This repo (32-text test set)
GPTZero / Originality (commercial)	75–85 %	60–70 %	Industry baseline

Caveats: our validation set is 32 texts; HC3 has 24,000+. Numbers in the "Adversarial" column use register-flipped prompts (formal-AI / casual-human) and are not directly comparable across papers. Treat the table as order-of-magnitude orientation.

Quickstart

Requirements

Python 3.11+
CUDA 12.4 GPU recommended (CPU works but is 10–30× slower)
NVIDIA NIM API key (free tier sufficient for testing) — or a local Ollama server with llama3.2

Install

git clone https://github.com/MorkMindy74/Busted.git
cd Busted
pip install -r requirements.txt

# Set your NIM key
cp .env.example .env
# edit .env and set NVIDIA_API_KEY

# Pull third-party deps (DeTeCtive, diveye)
# See VENDOR.md for instructions

# Build HC3 subset (only if you want to retrain classifiers)
python tests/build_hc3_subset.py

Run

# Default: 4-agent winning ensemble
uvicorn backend.main:app --host 127.0.0.1 --port 8765 --reload

# Custom subset
BUSTED_DETECTORS=diveye,benadv uvicorn backend.main:app --port 8765

Open http://127.0.0.1:8765 in your browser. Paste text, hit Analizza, watch the debate unfold in real time.

Reproduce the subset benchmark

python tests/subset_benchmark.py
# results -> tests/subset_benchmarks/summary.json

Allow ~3 hours for the full 5-subset sweep on an RTX 2050.

Repo layout

Busted/
├── backend/              # FastAPI app, agents, detectors, event bus, KG
│   ├── agents/           #   one wrapper per detector + moderator
│   ├── detectors/        #   pure detection logic (no framework)
│   ├── events/           #   pub/sub EventBus
│   ├── graph/            #   TemporalKnowledgeGraph + schema
│   ├── llm/              #   NIMScheduler with fallback pool
│   └── routes/           #   /api/analyze, /api/graph, WebSocket
├── frontend/             # vanilla HTML/JS + D3 graph viz
├── detective_models/     # our trained classifiers (joblib)
├── docs/                 # research log, methodology, negative results
├── tests/                # benchmark + extraction + training scripts
├── tasks/                # internal todo + lessons (development log)
├── config.py             # detector registry, weights, model pools
├── requirements.txt
├── LICENSE               # MIT (original code)
├── NOTICE.md             # third-party attribution
└── VENDOR.md             # how to fetch third-party deps

What's not in this repo (and why)

vendor/DeTeCtive — upstream has no LICENSE file at time of writing. Clone yourself; see VENDOR.md.
vendor/diveye — CC BY-NC-SA 4.0 from IBM. Clone yourself.
detective_models/M4_monolingual_best.pth (~476 MB) — third-party DeTeCtive checkpoint; download from HuggingFace.
hc3_data/ — Hello-SimpleAI HC3 raw dump. Build with tests/build_hc3_subset.py.
API keys — .env is git-ignored. Use .env.example as a template.

Contributing

Issues and PRs welcome. Ideas particularly worth exploring:

A 5th detector that doesn't trip the Nth-agent destabilization (theory: it must contribute orthogonal evidence — Cohen's d on independent features, not just any signal).
Evaluation on a larger held-out set (HC3 test split, RAID, M4-monolingual test).
A "fast" mode without the LLM judge (would cut latency from ~15 s to ~3 s per text).

Please run pytest and the subset benchmark before submitting changes that touch detector logic.

Citation

If you use this work, please cite the underlying detectors (the real science) in addition to this repo:

@misc{busted2026,
  title  = {Busted: A multi-agent AIGT detection framework with cross-
            examination debate},
  author = {Rossi, Marco},
  year   = {2026},
  url    = {https://github.com/MorkMindy74/Busted}
}

For the wrapped methods, cite their original papers (DivEye, DeTeCtive, DetectGPT, etc.) — see NOTICE.md.

License

MIT for the original code in this repo. Third-party components retain their respective licenses; see NOTICE.md.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Busted

What this project actually delivers

Architecture

Subset benchmark — `more agents ≠ better`

Methodology highlights

Cohen's d gate before integration

Skill-proportional weights

Cross-examination debate protocol

Honest benchmark comparison

Quickstart

Requirements

Install

Run

Reproduce the subset benchmark

Repo layout

What's not in this repo (and why)

Contributing

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
backend		backend
detective_models		detective_models
docs		docs
frontend		frontend
tasks		tasks
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE.md		NOTICE.md
README.md		README.md
VENDOR.md		VENDOR.md
config.py		config.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Busted

What this project actually delivers

Architecture

Subset benchmark — more agents ≠ better

Methodology highlights

Cohen's d gate before integration

Skill-proportional weights

Cross-examination debate protocol

Honest benchmark comparison

Quickstart

Requirements

Install

Run

Reproduce the subset benchmark

Repo layout

What's not in this repo (and why)

Contributing

Citation

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Subset benchmark — `more agents ≠ better`

Packages