A multi-agent AI-text-detection framework with cross-examination debate.
Busted wraps published AI-generated-text (AIGT) detectors — DivEye, BENADV, DeTeCtive, DetectGPT, LLM-as-judge — inside a structured debate protocol (cross-examination, novelty gates, anti-groupthink steelman, dissent quotas). A moderator agent aggregates verdicts via skill-proportional weighted voting and issues a final ruling with consensus zones, irreducible tensions, and a minority report.
⚠️ This is an engineering case study, not a SOTA detector. See Honest Benchmark Comparison below. The single best detector (DivEye, 97.7% on HC3) outperforms our 4-agent ensemble (81.2% on a 32-text validation set). The value of this repo is in the negative results, the methodology, and the reproducible engineering log.
- A working multi-agent debate framework for AIGT detection, with FastAPI backend, WebSocket streaming, and a D3 graph-viz frontend.
- A reproducible subset benchmark showing that a 4-detector ensemble
beats a 7-detector ensemble (
81.2%vs78.1%) on a held-out set — evidence that "more agents ≠ better" for cross-exam protocols. - A Cohen's d feature-gate methodology that falsified four candidate detectors (MFD, BENATTEN, fractals, "semantic curvature") before integration cost.
- Documented negative results — the kind that papers don't usually publish, but that save downstream researchers weeks of work.
┌──────────────┐ ┌─────────────────────────────────────────┐
│ POST /api/ │────▶│ TEXT_INPUT node (TemporalKnowledgeGraph) │
│ analyze │ └────────────────────┬────────────────────┘
└──────────────┘ │ EventBus broadcast
▼
┌──────────────┬──────────────┬──────────────┬──────────────┐
▼ ▼ ▼ ▼ ▼
DivEyeAgent BENADVAgent LLMJudgeAgent DetectiveAgent (others
(XGBoost on (RandomForest (NIM Nemotron (SimCSE+FAISS disabled)
surprisal on multi- reasoning) KNN)
stats) encoder
Benford)
│ │ │ │
└──────────────┴──────┬───────┴──────────────┘
▼
┌─────────────────────────────────┐
│ Phase 1: blind-first verdicts │
│ Phase 2: groupthink / dissent │
│ Phase 3: cross-examination │
│ (max 2 rounds, with │
│ novelty gate & │
│ PROTECTED-PAIR rule) │
│ Phase 4: weighted aggregation │
│ + FINAL_RULING node │
└─────────────┬───────────────────┘
│
▼
WebSocket stream → frontend
(D3 dagre graph + per-agent cards)
The 4 active agents (diveye, benadv, llm_judge, detective) are
configurable via the BUSTED_DETECTORS environment variable; the disabled
ones (statistical, stylometric, logprob, plus archived mfd, benatten,
zitnh, roberta_detector) remain in the codebase for further experimentation.
Each row was run end-to-end (server restart, full validation pipeline)
against the same 32-text validation set (20 base + 12 adversarial register-
flipped). Validation set is described in tests/validation_set.py.
| Subset | N | Overall | Base | Adversarial | Time/text |
|---|---|---|---|---|---|
solo_diveye |
1 | 65.6 % | 70 % | 58.3 % | 4.7 s |
diveye + benadv |
2 | 65.6 % | 70 % | 58.3 % | 12 s |
diveye + benadv + llm_judge |
3 | 65.6 % | 75 % | 50 % | 16 s |
diveye + benadv + llm_judge + detective |
4 | 81.2 % | 95 % | 58.3 % | 15 s |
| Full 7-detector ensemble | 7 | 78.1 % | 95 % | 50 % | 22 s |
Per-text wall-clock includes debate, LLM rounds, and graph commits.
Why 1–3 agents collapse to DivEye-solo accuracy: weights were calibrated
to per-detector accuracy (DivEye = 4.5, BENADV = 2.5, others ≤ 1.7). With
only 1–3 agents, DivEye dominates the vote; cross-exam needs at least one
counterweight set to function.
Why the 7th agent destabilizes the system: documented across six candidates (MFD, BENATTEN, ZiTNH, RoBERTa-Hello-SimpleAI, DivEye when added to a stable 6-set, and "semantic curvature"). Each one degraded adversarial recall by 8–25 pp. We hypothesise that the cross-exam protocol has a structural ceiling on ensemble size.
Reproduce: python tests/subset_benchmark.py
Every candidate feature is evaluated on 3000 HC3 samples before any classifier training. We require |Cohen's d| ≥ 0.5 on at least one feature plus an ablation against length / register confounds.
Falsified candidates (saved for posterity in docs/negative_results.md):
| Candidate | Best |d| | Verdict |
|---|---|---|
| Fractals (Hurst) | 0.42 | Below gate |
| MFD register-inv. | ~0.2 | Below gate |
| BENATTEN aggregate | ~0.5 | Centroid CV 68 % (below ensemble) |
| Semantic curvature | 1.44 | Failed length-confound ablation |
Passed candidates went into the active ensemble: BENADV (|d|=1.1), DivEye
(|d|=2.84 on mean_surprisal).
weight = 5 × max(0.05, accuracy − 0.5) from per-detector HC3 cross-
validation. The mapping pushes accurate detectors to ~4-5× the weight of
mediocre ones, which is necessary once you have a strong standalone
detector like DivEye but creates the small-ensemble collapse documented
above.
Inspired by the Council of High Intelligence skill, adapted to AIGT detection:
- Phase 1 — Blind-first: all detectors emit a
DETECTION_VERDICTnode without seeing each other (prevents anchoring bias). - Phase 2 — Disagreement detection: moderator detects polarity pairs and applies anti-groupthink (≥ 70 % agreement → forced steelman of opposite).
- Phase 3 — Cross-exam (max 2 rounds): each dissenter must respond to
≥ 1 specific opposing evidence and introduce ≥ 1 new claim
(novelty gate). Strong-pair rule (
PROTECTED_PAIR) prevents the two highest-weight detectors from being flipped by majority pressure when in the minority. - Phase 4 — Weighted aggregation with explicit
consensus_zones,irreducible_tensions, andminority_reportin the final ruling.
We do not claim state of the art. The table is what it is:
| System | HC3 accuracy | Adversarial | Notes |
|---|---|---|---|
| DivEye standalone (paper) | 97.7 % | n/a | Strongest single detector |
| Binoculars (2024) | ~92 % | ~80 % | Cross-perplexity ratio |
| DeTeCtive (NeurIPS 2024) | ~91 % | ~75 % | Multi-level contrastive |
| RADAR (2023) | ~88 % | ~85 % | Adversarial-trained |
| Busted 4-agent ensemble | 81.2 % | 58.3 % | This repo (32-text test set) |
| GPTZero / Originality (commercial) | 75–85 % | 60–70 % | Industry baseline |
Caveats: our validation set is 32 texts; HC3 has 24,000+. Numbers in the "Adversarial" column use register-flipped prompts (formal-AI / casual-human) and are not directly comparable across papers. Treat the table as order-of-magnitude orientation.
- Python 3.11+
- CUDA 12.4 GPU recommended (CPU works but is 10–30× slower)
- NVIDIA NIM API key (free tier sufficient for testing) — or a local
Ollama server with
llama3.2
git clone https://github.com/MorkMindy74/Busted.git
cd Busted
pip install -r requirements.txt
# Set your NIM key
cp .env.example .env
# edit .env and set NVIDIA_API_KEY
# Pull third-party deps (DeTeCtive, diveye)
# See VENDOR.md for instructions
# Build HC3 subset (only if you want to retrain classifiers)
python tests/build_hc3_subset.py# Default: 4-agent winning ensemble
uvicorn backend.main:app --host 127.0.0.1 --port 8765 --reload
# Custom subset
BUSTED_DETECTORS=diveye,benadv uvicorn backend.main:app --port 8765Open http://127.0.0.1:8765 in your browser. Paste text, hit Analizza, watch the debate unfold in real time.
python tests/subset_benchmark.py
# results -> tests/subset_benchmarks/summary.jsonAllow ~3 hours for the full 5-subset sweep on an RTX 2050.
Busted/
├── backend/ # FastAPI app, agents, detectors, event bus, KG
│ ├── agents/ # one wrapper per detector + moderator
│ ├── detectors/ # pure detection logic (no framework)
│ ├── events/ # pub/sub EventBus
│ ├── graph/ # TemporalKnowledgeGraph + schema
│ ├── llm/ # NIMScheduler with fallback pool
│ └── routes/ # /api/analyze, /api/graph, WebSocket
├── frontend/ # vanilla HTML/JS + D3 graph viz
├── detective_models/ # our trained classifiers (joblib)
├── docs/ # research log, methodology, negative results
├── tests/ # benchmark + extraction + training scripts
├── tasks/ # internal todo + lessons (development log)
├── config.py # detector registry, weights, model pools
├── requirements.txt
├── LICENSE # MIT (original code)
├── NOTICE.md # third-party attribution
└── VENDOR.md # how to fetch third-party deps
vendor/DeTeCtive— upstream has noLICENSEfile at time of writing. Clone yourself; seeVENDOR.md.vendor/diveye— CC BY-NC-SA 4.0 from IBM. Clone yourself.detective_models/M4_monolingual_best.pth(~476 MB) — third-party DeTeCtive checkpoint; download from HuggingFace.hc3_data/— Hello-SimpleAI HC3 raw dump. Build withtests/build_hc3_subset.py.- API keys —
.envis git-ignored. Use.env.exampleas a template.
Issues and PRs welcome. Ideas particularly worth exploring:
- A 5th detector that doesn't trip the Nth-agent destabilization (theory: it must contribute orthogonal evidence — Cohen's d on independent features, not just any signal).
- Evaluation on a larger held-out set (HC3 test split, RAID, M4-monolingual test).
- A "fast" mode without the LLM judge (would cut latency from ~15 s to ~3 s per text).
Please run pytest and the subset benchmark before submitting changes that
touch detector logic.
If you use this work, please cite the underlying detectors (the real science) in addition to this repo:
@misc{busted2026,
title = {Busted: A multi-agent AIGT detection framework with cross-
examination debate},
author = {Rossi, Marco},
year = {2026},
url = {https://github.com/MorkMindy74/Busted}
}For the wrapped methods, cite their original papers (DivEye, DeTeCtive,
DetectGPT, etc.) — see NOTICE.md.
MIT for the original code in this repo. Third-party components
retain their respective licenses; see NOTICE.md.