Skip to content

KillianM00/polysec-harness

Repository files navigation

polysec-harness

A production-grade, multi-agent LLM security research pipeline. Autonomously audits smart contracts and web applications by spawning role-specialized AI subprocesses that read code, write executable proofs-of-concept, and verify findings through an adversarial second pass — with calibrated severity scoring, reproducible PoC artifacts, and hard scope enforcement.

Built around the observation that finding bugs and grading bugs require different contexts: the LLM that hypothesizes an exploit has yes-bias toward its own hypothesis, so the verification tier runs as an independent process, with a fresh context, against four pass/fail gates.


What it does

Given a bug-bounty program scope (Code4rena, Immunefi, HackerOne, Sherlock, Cantina, etc.), the harness:

  1. Ranks targets by expected value — post-audit code churn, prior disclosure density, payout ceiling, target-density scoring.
  2. Investigates each ranked target in parallel — reads the source, forms a specific hypothesis, writes an executable PoC (a Foundry test for EVM; a reproducible HTTP exchange for web2), and iterates until the proof either demonstrates impact or honestly refutes the hypothesis.
  3. Verifies each investigation adversarially — re-runs the PoC from a fresh context, checks assertions are impact-shaped, classifies the attacker against program exclusion rules, and confirms the bypassed feature is operationally enabled on the deployed contract.
  4. Reproduces verified findings three times in fresh sandboxes to confirm determinism.
  5. Generates submission-grade reports matched to the program's template, gated on a final operator review.

All tier-to-tier communication is atomic file writes to state/. Each role runs in a separate subprocess with its own context window, model selection, and rate-limit pool.

Architecture

                 ┌─────────────────────┐
   Campaign cfg → │  Orchestrator       │  (Opus 4.7 — low-volume coordinator)
                 └──────────┬──────────┘
                            │ writes state/directives/cycle_NNN.md
                            ▼
                 ┌─────────────────────┐
                 │  Priority Ranker    │  (Haiku 4.5 — fast triage)
                 └──────────┬──────────┘
                            │ writes state/targets.jsonl
                            ▼
            ┌───────────────┴────────────────┐
            │   Investigators × N            │  (Opus 4.7 — deep code reading + forge loop)
            │   (parallel, disjoint context) │
            └───────────────┬────────────────┘
                            │ writes state/investigations/cycle_NNN/I####.md
                            │     + state/repos/<campaign>/test/polysec/I####.t.sol
                            ▼
                 ┌─────────────────────┐
                 │  Verifier           │  (Opus 4.7 — adversarial second pass)
                 │  Gates A / B / C    │
                 └──────────┬──────────┘
                            │ writes state/verified_findings/<profile>/F####.md
                            ▼
                 ┌─────────────────────┐
                 │  Reproducer         │  (3× re-run in fresh anvil-fork or HTTP-replay)
                 └──────────┬──────────┘
                            │ writes state/repros/F####/result.json
                            ▼
                 ┌─────────────────────┐
                 │  Report Writer      │  (Opus 4.7 — program-template-aware)
                 │  → operator gate    │
                 └─────────────────────┘

The orchestrator coordinates cycle lifecycle, enforces per-cycle and session budgets, dispatches tiers, and applies cross-cutting policies (refusal recovery, dud-rate auto-stop, dedup scoring). Each downstream tier reads upstream artifacts from state/ and writes its own atomic artifact for the next tier to consume — no shared memory, no message bus, no hidden state.

The verification model

The investigator's primary artifact is not a hypothesis paragraph — it is an executable PoC. A Foundry test that runs and asserts a balance change, or an HTTP exchange that replays and produces impact-shaped response divergence. If the model can't write a PoC that demonstrates impact, the investigation exits cleanly as "no vulnerability found, here's what I ruled out."

The verifier then applies four hard gates. Any gate failure caps severity at L2 and sets promoted=false:

  • Gate A — test re-runs and passes. Verifier re-executes the investigator's Foundry test (or replays the HTTP exchange) from disk in a fresh context. If it doesn't pass cold, the investigator hallucinated; reject.
  • Gate B — assertions are impact-shaped. Verifier reads the test source. Disqualifying patterns: assertTrue(success) only, vm.expectRevert only, no balance / state-mutation asserts. Reports must demonstrate a dollar consequence or an unauthorized state change, not merely "the function call did not error."
  • Gate C1 — attacker class is in program scope. The test's vm.prank(attacker) (or HTTP request's Authorization header) is classified against state/scope.md's exclusion rules. Privileged-role compromise that the program excludes (admin keys, internal employees, compromised customers) is N/A regardless of code correctness.
  • Gate C2 — feature is operationally enabled. Via mainnet_probe, the verifier confirms the bypassed feature is actually live on the deployed contract. A bypass that's real in code but unreachable on the live deployment drops to L1.

The gates exist to enforce a discipline that solo LLM bounty-hunting workflows typically lack: real submissions must be reproducible, must demonstrate impact, must target permissionless actors, and must apply operationally.

Calibration

The verifier's severity rating is anchored to feature presence, not feel. Anchors are concrete: "PoC test demonstrates state mutation by an unauthorized caller and the function modifies a fund-holding balance → access_control L4+." After iteration on a 300-disclosure ground-truth corpus from Code4rena, Sherlock, and Immunefi public disclosures, the judge reaches ~80% exact-match severity rating against human-rated outcomes.

Calibration is measured continuously: every promoted finding is scored against the rubric, and a drift audit re-scores prior findings every five cycles to detect rubric drift over time.

Capabilities

Polymorphic over four campaign profiles, each with profile-specific tool catalogs, investigator system prompts, and severity rubrics:

Profile Targets Investigator focus
A Web2 bounty HackerOne / Bugcrowd OWASP class severity + impact; HTTP exchange as PoC
B Smart contract Code4rena / Sherlock / Immunefi Solidity vuln classes; Foundry test as PoC
C AI/ML security Anthropic / OpenAI / HF Prompt injection / model extraction taxonomy
D Generic / CTF Bounded code tasks with tests-as-oracle Correctness + coverage

60+ specialized tools across nine sections — every tool follows a single digestion contract (bounded structured output, pre-filtered for anomalies, self-describing errors with next-step hints, action hints where applicable). Highlights:

  • EVM offensive primitives: foundry_test, slither_run, mythril_run, anvil_snapshot_replay (fork-state diff testing), chainlink_feed_probe, uniswap_amm_probe (V2 closed-form + V3 QuoterV2 integration), flash_loan_orchestrator (Aave V3 / Balancer V2 / Maker DSS-Flash provider selection), mev_bundle_sim (Flashbots eth_callBundle integration), call_graph_extractor (Slither-backed function summaries), role_actor_map (regex-based permission inventory), exploit_template_library (parameterized templates for oracle manipulation, donation attacks, first-depositor inflation, sandwich attacks), mainnet_probe (cast-call wrapper with source-path-to-address resolution), postaudit_diff (per-file git churn extractor).
  • Web2: http_request with baseline-diff mode, payload_render (templated SQLi/XSS/SSRF/SSTI/CSRF variants ranked by WAF-bypass score), race_condition_runner, cors_probe, auth_replay, burp_tap proxy mode.
  • Recon: subdomain_enum, js_analyzer (endpoint extraction from public bundles), param_discovery, ct_log_watch, wayback_diff, github_recon (pre-redacted secret-leak search).
  • Verification: poc_runner (sandbox PoC execution), differential_test, cross_target_scan (hypothesis fanned across N targets), repro_3x, dupe_check (disclosure-DB similarity search).
  • AI/ML: prompt_inject_runner, model_fingerprint, jailbreak_corpus, mcp_introspect, embedding_distance.

Hardening

  • Sandbox isolation: every per-attempt execution runs in a Docker container with a default-deny network bridge, read-only rootfs, tmpfs /work, CPU/mem caps, --cap-drop=ALL, and seccomp profiles. Egress allowlist enforced at a mitmproxy bridge.
  • Pre-engagement scope filter: every campaign.targets[].surface_ref is substring-validated against state/scope.md's "In scope" block before any LLM call is made. Out-of-scope candidates fail closed with no token spend.
  • Refusal-recovery protocol: distinguishes "model refuses legitimate authorized work" (retry with augmented scope context, up to 3 retries) from "model emits structured escalation" (operator-resolves before any further tier dispatch).
  • Atomic-write IPC: all tier artifacts written via temp-file + rename; survives kill -9 mid-write.
  • Cost ledger: per-invocation {ts, role, model, input_tokens, output_tokens, cache_read_tokens, cache_write_tokens, usd} ledger in state/cost_ledger.jsonl. Cycle and session budgets enforced by the orchestrator; reconciliation against Anthropic billing within 2%.
  • Dud-rate auto-stop: if N of last N investigations return severity=0, the orchestrator refuses to dispatch further cycles until the operator intervenes.
  • Drift audit: every five cycles, the verifier re-scores five randomly-sampled prior findings against the current rubric; a state/drift_audits/log.jsonl records score deltas so rubric drift is detectable.

Validation

Cleanly validates end-to-end on Damn Vulnerable DeFi v4: 3 of 3 cycle-16 targets (unstoppable, naive-receiver, truster) produced passing Foundry exploits with impact-shaped assertions. The naive-receiver exploit chained an 11-call multicall through a meta-tx forwarder to drain $3M of WETH in a single transaction with nonce ≤ 2 — a textbook chained DeFi exploit demonstrating multi-step reasoning across contract boundaries.

Calibration eval reaches ~80% exact-match severity rating on a 300-disclosure ground-truth corpus (Phase 1H).

Tech stack

  • Python 3.12 via uv; strict mypy on polysec/, lenient on scripts/
  • Pydantic v2 for every artifact schema (polysec/schemas.py)
  • Claude Code subprocesses (claude --print --output-format json) per tier
  • Foundry (forge / cast / anvil) for EVM PoC execution + mainnet fork probes
  • Docker for sandboxed per-attempt isolation
  • mitmproxy for egress allowlist enforcement + forensic traffic logging
  • MCP (Model Context Protocol) for tool exposure to each Claude subprocess
  • Slither + Mythril wrappers for static analysis primitives
  • pytest for the unit + integration + eval test suites (~1000+ tests)

Code organization

polysec/                  main package
├── orchestrator.py       cycle lifecycle, tier dispatch, dud-rate gate
├── claude_invoker.py     spawn `claude --print` with envelope parsing
├── ipc.py                atomic file writes, frontmatter-markdown serde
├── schemas.py            Pydantic models for every tier artifact
├── refusal_recovery.py   3-tier retry with scope re-anchoring
├── budgets.py            per-cycle / session / tier budget enforcement
├── eval/                 calibration eval pipeline (300+ disclosures)
├── reproducer/           Foundry-in-Docker + HTTP-replay reproducers
├── sandbox/              Docker spawn/exec/teardown
├── proxy/                mitmproxy controller + allowlist enforcement
├── mcp/                  MCP server exposing tool catalog to Claude
├── dedup/                target-density scoring for ranker
├── bounties/             program-metadata fetchers (H1, Immunefi, C4)
└── tools/                ~60 tool catalog across 9 sections (see Capabilities)

prompts/                  role system prompts
├── orchestrator.md
├── ranker.md
├── investigator_a_web2.md     HTTP-exchange-first investigator
├── investigator_b_evm.md      forge-test-first investigator
├── investigator_c_aiml.md
├── investigator_d_generic.md
├── verifier.md                4-gate adversarial pass, profile-aware
├── reproducer.md
├── report_writer.md
└── shared/                    refusal_recovery, output_schemas, tool_catalog_*

campaigns/                per-campaign YAML config
audit/                    self-audit of the tool catalog (programmatic verification)
docs/                     architecture, operator_guide, threat_model
scripts/                  start_cycle, eval_run, dupe_watch, …
tests/                    unit + integration + eval pytest suites

Notable engineering decisions

  • Anchored severity calibration. Most LLM-judge designs let the model pick a number from 1-5. Without anchors, the model's 4 and a human triager's 4 are different things, and that drift kills triage acceptance rates. The verifier prompt has ~80 lines of concrete anchors keyed to feature presence; calibration is measurable and tunable.
  • Test-first investigator mandate. The investigator's primary artifact is a passing executable proof, not a paragraph. If the model can't write one, it returns "no vulnerability found" and the cycle moves on — eliminating the "I think this might be exploitable because…" failure mode entirely.
  • Impact-shaped assertion detector. Gate B reads the test source and pattern-matches against disqualifying assertion shapes. Tests that pass without proving impact get capped at L2. This catches PoCs that look like passing tests but don't demonstrate the dollar consequence.
  • Refusal recovery distinct from escalation. "Model refuses legitimate authorized work" and "model emits a genuine scope question" look similar but have different semantics. The harness handles them separately — refusals retry with augmented context; escalations block on operator resolution.
  • Atomic-write IPC. Every tier artifact is written via temp-file + rename to prevent half-written reads under crash. Tested under kill -9 mid-write.
  • Cost-aware model tiering. Haiku 4.5 for high-volume triage (ranker), Opus 4.7 for load-bearing reasoning (investigator, verifier, report writer). Prompt caching anchors on stable role system prompts to halve input-token costs at scale.

Skills demonstrated

  • Multi-agent LLM systems design: tier separation, file-based IPC, separate context windows, separate rate-limit pools, atomic state transitions.
  • Calibration engineering: ground-truth corpus construction, anchored rubric design, exact-match metric tracking, drift detection.
  • Security engineering: threat modeling, scope enforcement, sandbox isolation, egress allowlist, refusal-recovery protocol design, privileged-actor classification.
  • Domain breadth: OWASP Top 10 (web2), Solidity vulnerability classes (EVM), AI/ML attack taxonomy (prompt injection, model extraction, jailbreak corpus), bounty-program operational details (researcher headers, KYC exclusions, primacy-of-impact rules, CVSS-tier mapping).
  • Cost-aware infrastructure: per-invocation ledger, tiered budgets, prompt caching, dud-rate auto-stop.
  • Production tooling: Pydantic v2 schemas, mypy strict mode, pytest, ruff, pre-commit hooks, Docker isolation, MCP integration.

Getting started

git clone https://github.com/KillianM00/polysec-harness.git
cd polysec-harness
uv venv && source .venv/bin/activate
uv pip install -e .

# Run the unit tests
pytest tests/unit

# Run the calibration eval
python scripts/eval_run.py --config evals/phase1h.yaml

# Dispatch a cycle against a configured campaign
python scripts/start_cycle.py --campaign <campaign-id> --cycle <n>

See docs/architecture.md for internals and docs/operator_guide.md for the day-to-day workflow.

License

AGPL-3.0. Derivatives must remain open. Commercial licenses available on request.

Contact

Killian Miller — killianmiller6@gmail.com — github.com/KillianM00

About

Production-grade multi-agent LLM security research pipeline with adversarial verification and calibrated severity scoring.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages