A production-grade, multi-agent LLM security research pipeline. Autonomously audits smart contracts and web applications by spawning role-specialized AI subprocesses that read code, write executable proofs-of-concept, and verify findings through an adversarial second pass — with calibrated severity scoring, reproducible PoC artifacts, and hard scope enforcement.
Built around the observation that finding bugs and grading bugs require different contexts: the LLM that hypothesizes an exploit has yes-bias toward its own hypothesis, so the verification tier runs as an independent process, with a fresh context, against four pass/fail gates.
Given a bug-bounty program scope (Code4rena, Immunefi, HackerOne, Sherlock, Cantina, etc.), the harness:
- Ranks targets by expected value — post-audit code churn, prior disclosure density, payout ceiling, target-density scoring.
- Investigates each ranked target in parallel — reads the source, forms a specific hypothesis, writes an executable PoC (a Foundry test for EVM; a reproducible HTTP exchange for web2), and iterates until the proof either demonstrates impact or honestly refutes the hypothesis.
- Verifies each investigation adversarially — re-runs the PoC from a fresh context, checks assertions are impact-shaped, classifies the attacker against program exclusion rules, and confirms the bypassed feature is operationally enabled on the deployed contract.
- Reproduces verified findings three times in fresh sandboxes to confirm determinism.
- Generates submission-grade reports matched to the program's template, gated on a final operator review.
All tier-to-tier communication is atomic file writes to state/. Each role runs in
a separate subprocess with its own context window, model selection, and rate-limit
pool.
┌─────────────────────┐
Campaign cfg → │ Orchestrator │ (Opus 4.7 — low-volume coordinator)
└──────────┬──────────┘
│ writes state/directives/cycle_NNN.md
▼
┌─────────────────────┐
│ Priority Ranker │ (Haiku 4.5 — fast triage)
└──────────┬──────────┘
│ writes state/targets.jsonl
▼
┌───────────────┴────────────────┐
│ Investigators × N │ (Opus 4.7 — deep code reading + forge loop)
│ (parallel, disjoint context) │
└───────────────┬────────────────┘
│ writes state/investigations/cycle_NNN/I####.md
│ + state/repos/<campaign>/test/polysec/I####.t.sol
▼
┌─────────────────────┐
│ Verifier │ (Opus 4.7 — adversarial second pass)
│ Gates A / B / C │
└──────────┬──────────┘
│ writes state/verified_findings/<profile>/F####.md
▼
┌─────────────────────┐
│ Reproducer │ (3× re-run in fresh anvil-fork or HTTP-replay)
└──────────┬──────────┘
│ writes state/repros/F####/result.json
▼
┌─────────────────────┐
│ Report Writer │ (Opus 4.7 — program-template-aware)
│ → operator gate │
└─────────────────────┘
The orchestrator coordinates cycle lifecycle, enforces per-cycle and session budgets,
dispatches tiers, and applies cross-cutting policies (refusal recovery, dud-rate
auto-stop, dedup scoring). Each downstream tier reads upstream artifacts from
state/ and writes its own atomic artifact for the next tier to consume — no
shared memory, no message bus, no hidden state.
The investigator's primary artifact is not a hypothesis paragraph — it is an executable PoC. A Foundry test that runs and asserts a balance change, or an HTTP exchange that replays and produces impact-shaped response divergence. If the model can't write a PoC that demonstrates impact, the investigation exits cleanly as "no vulnerability found, here's what I ruled out."
The verifier then applies four hard gates. Any gate failure caps severity at L2 and
sets promoted=false:
- Gate A — test re-runs and passes. Verifier re-executes the investigator's Foundry test (or replays the HTTP exchange) from disk in a fresh context. If it doesn't pass cold, the investigator hallucinated; reject.
- Gate B — assertions are impact-shaped. Verifier reads the test source.
Disqualifying patterns:
assertTrue(success)only,vm.expectRevertonly, no balance / state-mutation asserts. Reports must demonstrate a dollar consequence or an unauthorized state change, not merely "the function call did not error." - Gate C1 — attacker class is in program scope. The test's
vm.prank(attacker)(or HTTP request'sAuthorizationheader) is classified againststate/scope.md's exclusion rules. Privileged-role compromise that the program excludes (admin keys, internal employees, compromised customers) is N/A regardless of code correctness. - Gate C2 — feature is operationally enabled. Via
mainnet_probe, the verifier confirms the bypassed feature is actually live on the deployed contract. A bypass that's real in code but unreachable on the live deployment drops to L1.
The gates exist to enforce a discipline that solo LLM bounty-hunting workflows typically lack: real submissions must be reproducible, must demonstrate impact, must target permissionless actors, and must apply operationally.
The verifier's severity rating is anchored to feature presence, not feel. Anchors are concrete: "PoC test demonstrates state mutation by an unauthorized caller and the function modifies a fund-holding balance → access_control L4+." After iteration on a 300-disclosure ground-truth corpus from Code4rena, Sherlock, and Immunefi public disclosures, the judge reaches ~80% exact-match severity rating against human-rated outcomes.
Calibration is measured continuously: every promoted finding is scored against the rubric, and a drift audit re-scores prior findings every five cycles to detect rubric drift over time.
Polymorphic over four campaign profiles, each with profile-specific tool catalogs, investigator system prompts, and severity rubrics:
| Profile | Targets | Investigator focus |
|---|---|---|
| A Web2 bounty | HackerOne / Bugcrowd | OWASP class severity + impact; HTTP exchange as PoC |
| B Smart contract | Code4rena / Sherlock / Immunefi | Solidity vuln classes; Foundry test as PoC |
| C AI/ML security | Anthropic / OpenAI / HF | Prompt injection / model extraction taxonomy |
| D Generic / CTF | Bounded code tasks with tests-as-oracle | Correctness + coverage |
60+ specialized tools across nine sections — every tool follows a single digestion contract (bounded structured output, pre-filtered for anomalies, self-describing errors with next-step hints, action hints where applicable). Highlights:
- EVM offensive primitives:
foundry_test,slither_run,mythril_run,anvil_snapshot_replay(fork-state diff testing),chainlink_feed_probe,uniswap_amm_probe(V2 closed-form + V3 QuoterV2 integration),flash_loan_orchestrator(Aave V3 / Balancer V2 / Maker DSS-Flash provider selection),mev_bundle_sim(Flashbotseth_callBundleintegration),call_graph_extractor(Slither-backed function summaries),role_actor_map(regex-based permission inventory),exploit_template_library(parameterized templates for oracle manipulation, donation attacks, first-depositor inflation, sandwich attacks),mainnet_probe(cast-call wrapper with source-path-to-address resolution),postaudit_diff(per-file git churn extractor). - Web2:
http_requestwith baseline-diff mode,payload_render(templated SQLi/XSS/SSRF/SSTI/CSRF variants ranked by WAF-bypass score),race_condition_runner,cors_probe,auth_replay,burp_tapproxy mode. - Recon:
subdomain_enum,js_analyzer(endpoint extraction from public bundles),param_discovery,ct_log_watch,wayback_diff,github_recon(pre-redacted secret-leak search). - Verification:
poc_runner(sandbox PoC execution),differential_test,cross_target_scan(hypothesis fanned across N targets),repro_3x,dupe_check(disclosure-DB similarity search). - AI/ML:
prompt_inject_runner,model_fingerprint,jailbreak_corpus,mcp_introspect,embedding_distance.
- Sandbox isolation: every per-attempt execution runs in a Docker container
with a default-deny network bridge, read-only rootfs, tmpfs
/work, CPU/mem caps,--cap-drop=ALL, and seccomp profiles. Egress allowlist enforced at a mitmproxy bridge. - Pre-engagement scope filter: every
campaign.targets[].surface_refis substring-validated againststate/scope.md's "In scope" block before any LLM call is made. Out-of-scope candidates fail closed with no token spend. - Refusal-recovery protocol: distinguishes "model refuses legitimate authorized work" (retry with augmented scope context, up to 3 retries) from "model emits structured escalation" (operator-resolves before any further tier dispatch).
- Atomic-write IPC: all tier artifacts written via temp-file + rename; survives
kill -9mid-write. - Cost ledger: per-invocation
{ts, role, model, input_tokens, output_tokens, cache_read_tokens, cache_write_tokens, usd}ledger instate/cost_ledger.jsonl. Cycle and session budgets enforced by the orchestrator; reconciliation against Anthropic billing within 2%. - Dud-rate auto-stop: if N of last N investigations return severity=0, the orchestrator refuses to dispatch further cycles until the operator intervenes.
- Drift audit: every five cycles, the verifier re-scores five randomly-sampled
prior findings against the current rubric; a
state/drift_audits/log.jsonlrecords score deltas so rubric drift is detectable.
Cleanly validates end-to-end on Damn Vulnerable DeFi v4: 3 of 3 cycle-16 targets (unstoppable, naive-receiver, truster) produced passing Foundry exploits with impact-shaped assertions. The naive-receiver exploit chained an 11-call multicall through a meta-tx forwarder to drain $3M of WETH in a single transaction with nonce ≤ 2 — a textbook chained DeFi exploit demonstrating multi-step reasoning across contract boundaries.
Calibration eval reaches ~80% exact-match severity rating on a 300-disclosure ground-truth corpus (Phase 1H).
- Python 3.12 via
uv; strict mypy onpolysec/, lenient onscripts/ - Pydantic v2 for every artifact schema (
polysec/schemas.py) - Claude Code subprocesses (
claude --print --output-format json) per tier - Foundry (forge / cast / anvil) for EVM PoC execution + mainnet fork probes
- Docker for sandboxed per-attempt isolation
- mitmproxy for egress allowlist enforcement + forensic traffic logging
- MCP (Model Context Protocol) for tool exposure to each Claude subprocess
- Slither + Mythril wrappers for static analysis primitives
- pytest for the unit + integration + eval test suites (~1000+ tests)
polysec/ main package
├── orchestrator.py cycle lifecycle, tier dispatch, dud-rate gate
├── claude_invoker.py spawn `claude --print` with envelope parsing
├── ipc.py atomic file writes, frontmatter-markdown serde
├── schemas.py Pydantic models for every tier artifact
├── refusal_recovery.py 3-tier retry with scope re-anchoring
├── budgets.py per-cycle / session / tier budget enforcement
├── eval/ calibration eval pipeline (300+ disclosures)
├── reproducer/ Foundry-in-Docker + HTTP-replay reproducers
├── sandbox/ Docker spawn/exec/teardown
├── proxy/ mitmproxy controller + allowlist enforcement
├── mcp/ MCP server exposing tool catalog to Claude
├── dedup/ target-density scoring for ranker
├── bounties/ program-metadata fetchers (H1, Immunefi, C4)
└── tools/ ~60 tool catalog across 9 sections (see Capabilities)
prompts/ role system prompts
├── orchestrator.md
├── ranker.md
├── investigator_a_web2.md HTTP-exchange-first investigator
├── investigator_b_evm.md forge-test-first investigator
├── investigator_c_aiml.md
├── investigator_d_generic.md
├── verifier.md 4-gate adversarial pass, profile-aware
├── reproducer.md
├── report_writer.md
└── shared/ refusal_recovery, output_schemas, tool_catalog_*
campaigns/ per-campaign YAML config
audit/ self-audit of the tool catalog (programmatic verification)
docs/ architecture, operator_guide, threat_model
scripts/ start_cycle, eval_run, dupe_watch, …
tests/ unit + integration + eval pytest suites
- Anchored severity calibration. Most LLM-judge designs let the model pick a
number from 1-5. Without anchors, the model's
4and a human triager's4are different things, and that drift kills triage acceptance rates. The verifier prompt has ~80 lines of concrete anchors keyed to feature presence; calibration is measurable and tunable. - Test-first investigator mandate. The investigator's primary artifact is a passing executable proof, not a paragraph. If the model can't write one, it returns "no vulnerability found" and the cycle moves on — eliminating the "I think this might be exploitable because…" failure mode entirely.
- Impact-shaped assertion detector. Gate B reads the test source and pattern-matches against disqualifying assertion shapes. Tests that pass without proving impact get capped at L2. This catches PoCs that look like passing tests but don't demonstrate the dollar consequence.
- Refusal recovery distinct from escalation. "Model refuses legitimate authorized work" and "model emits a genuine scope question" look similar but have different semantics. The harness handles them separately — refusals retry with augmented context; escalations block on operator resolution.
- Atomic-write IPC. Every tier artifact is written via temp-file + rename to
prevent half-written reads under crash. Tested under
kill -9mid-write. - Cost-aware model tiering. Haiku 4.5 for high-volume triage (ranker), Opus 4.7 for load-bearing reasoning (investigator, verifier, report writer). Prompt caching anchors on stable role system prompts to halve input-token costs at scale.
- Multi-agent LLM systems design: tier separation, file-based IPC, separate context windows, separate rate-limit pools, atomic state transitions.
- Calibration engineering: ground-truth corpus construction, anchored rubric design, exact-match metric tracking, drift detection.
- Security engineering: threat modeling, scope enforcement, sandbox isolation, egress allowlist, refusal-recovery protocol design, privileged-actor classification.
- Domain breadth: OWASP Top 10 (web2), Solidity vulnerability classes (EVM), AI/ML attack taxonomy (prompt injection, model extraction, jailbreak corpus), bounty-program operational details (researcher headers, KYC exclusions, primacy-of-impact rules, CVSS-tier mapping).
- Cost-aware infrastructure: per-invocation ledger, tiered budgets, prompt caching, dud-rate auto-stop.
- Production tooling: Pydantic v2 schemas, mypy strict mode, pytest, ruff, pre-commit hooks, Docker isolation, MCP integration.
git clone https://github.com/KillianM00/polysec-harness.git
cd polysec-harness
uv venv && source .venv/bin/activate
uv pip install -e .
# Run the unit tests
pytest tests/unit
# Run the calibration eval
python scripts/eval_run.py --config evals/phase1h.yaml
# Dispatch a cycle against a configured campaign
python scripts/start_cycle.py --campaign <campaign-id> --cycle <n>See docs/architecture.md for internals and docs/operator_guide.md for the
day-to-day workflow.
AGPL-3.0. Derivatives must remain open. Commercial licenses available on request.
Killian Miller — killianmiller6@gmail.com — github.com/KillianM00