Swarm Intelligence Prediction Engine for Prediction Markets
Named after the schooling behavior of fish — individually simple, collectively intelligent.
K-Fish deploys 9 LLM agents ("Fish") that each use a structurally different reasoning framework to analyze prediction markets. Their independent probability estimates are fused through a multi-round Delphi protocol, calibrated with machine learning, and converted into risk-controlled positions using the Kelly criterion. The entire system runs at zero cost via Claude Code CLI.
flowchart TD
A["`**MARKET QUESTION**
price withheld from all Fish agents`"]
B["`**SWARM ROUTER**
classifies category
selects personas, rounds, extremization`"]
NEWS["`**NEWS RETRIEVAL** ★ new
trafilatura scrapes top articles
sentence-transformers ranks by relevance
top 3 injected into Fish prompts`"]
C["`**RESEARCHER FISH**
base rates · key facts
timing · contrarian case · news context`"]
subgraph DELPHI [" MULTI-ROUND DELPHI PROTOCOL "]
direction TB
R1["`**Round 1** — Independent`"]
R2["`**Round 2** — Peer Context`"]
RN["`**Round N** — Converge`"]
R1 -- "anonymized estimates" --> R2 -- "update or hold" --> RN
end
D["`**AGGREGATION**
trimmed mean · confidence-weighted
asymmetric extremization`"]
E1["`**CALIBRATE**
netcal auto-select
Beta · Histogram · Isotonic`"]
BIAS["`**AI BIAS DETECTION** ★ new
5-layer RLHF decompressor
compression → decompress
knowledge gap → follow crowd`"]
E2["`**VOLATILITY**
GARCH regime detection
Kelly adjustment factor`"]
F["`**EDGE DETECTION**
empirically optimized threshold
confidence > 40% · spread < 35%`"]
G["`**KELLY SIZING**
quarter-Kelly · 5% max/position
30% max exposure · 15% drawdown stop`"]
H["`**POSITION**
side YES/NO · size $
expected value · reasoning chain`"]
A --> B --> NEWS --> C --> DELPHI --> D
D --> E1 & BIAS & E2
E1 & BIAS & E2 --> F
F --> G --> H
Market: Will the Fed increase interest rates by 25+ bps after the March 2026 meeting?
flowchart TD
Q["`**Market Question**
Will the Fed increase interest rates
by 25+ bps after the March 2026 meeting?`"]
Q --> FISH
subgraph FISH [" 9 Fish Predict Independently "]
direction LR
F1["🎯 Anchor — **0.20**"]
F2["🔀 Decomp — **0.18**"]
F3["🔍 Inside — **0.15**"]
F4["⚡ Contra — **0.28**"]
F5["⏱️ Tempo — **0.22**"]
F6["🏛️ Instit — **0.17**"]
F7["💀 Premrt — **0.25**"]
F8["📐 Calibr — **0.19**"]
F9["📊 Bayes — **0.16**"]
end
FISH --> AGG["`**Aggregation**
trimmed mean = 0.200`"]
AGG --> EXT["`**Extremization**
0.200 → 0.178`"]
EXT --> CAL["`**Calibration**
0.178 → 0.165`"]
CAL --> EDGE{"`**Edge Check**
|0.165 − 0.22| = 5.5%
threshold = 7%`"}
EDGE -->|"edge too small"| SKIP["`**NO POSITION**
edge below threshold`"]
ACTUAL["`**Actual Outcome**
Fed did not raise rates
K-Fish correct ✓ Brier = 0.027`"]
SKIP ~~~ ACTUAL
9 Fish — each with orthogonal reasoning (click to expand)
Each persona encodes a structurally different decomposition strategy to maximize ensemble diversity (Schoenegger et al., Science Advances 2024).
| # | Fish | Reasoning Framework | Function |
|---|---|---|---|
| 1 | Base Rate Anchor | Reference class frequency | Anchors on historical base rates, adjusts minimally |
| 2 | Decomposer | Sub-probability multiplication | Breaks question into independent conditional sub-questions |
| 3 | Inside View | Domain-specific evidence | Finds the single most informative fact others miss |
| 4 | Contrarian | Consensus stress-testing | Constructs the strongest case for the less popular outcome |
| 5 | Temporal Analyst | Timing and momentum | Deadline analysis, hazard rates, trajectory |
| 6 | Institutional Analyst | Organizational incentives | Status quo bias, decision-maker constraints |
| 7 | Premortem | Failure scenario enumeration | Imagines why the expected outcome failed |
| 8 | Calibrator | Tetlock superforecaster protocol | Base rate → evidence → incremental update → bias check |
| 9 | Bayesian Updater | Explicit prior x likelihood | States prior, identifies evidence, applies Bayes' rule |
200 resolved Polymarket markets · 9 Fish · Claude Haiku CLI · $0 cost · 7.7 hours runtime
| Metric | K-Fish v5 (N=200) | K-Fish v4 (N=30) | Random |
|---|---|---|---|
| Brier Score | 0.206 | 0.213 | 0.250 |
| Accuracy | 69.0% | 73.3% | 50% |
| ECE | 0.140 | 0.178 | 0.250 |
| BSS vs Random | +17.6% | +14.8% | 0% |
| Cost | $0.00 | $0.00 | — |
Note
BSS (Brier Skill Score) = +17.6% means K-Fish predictions are 17.6% more accurate than random guessing. The system does not yet beat the Polymarket crowd aggregate (Brier ~0.084), which incorporates information from thousands of traders including whales and insiders. The gap is primarily driven by surprise events beyond the LLM training data cutoff.
Per-Fish Performance Rankings (N=200)
| Rank | Persona | Brier | Assessment |
|---|---|---|---|
| 1 | Contrarian | 0.199 | Best at N=200 — consensus stress-testing adds real value |
| 2 | Inside View | 0.206 | Domain expertise remains strong |
| 3 | Premortem | 0.207 | Improved with more data (was worst at N=30) |
| 4 | Calibrator | 0.209 | Tetlock method is consistently reliable |
| 5 | Decomposer | 0.211 | Conditional decomposition adds moderate value |
| 6 | Bayesian | 0.213 | Explicit prior/likelihood reasoning |
| 7 | Temporal | 0.217 | Timing analysis improved with larger sample |
| 8 | Institutional | 0.224 | Status quo analysis less useful than expected |
| 9 | Base Rate | 0.226 | Anchoring too heavily on base rates hurts on novel events |
| ai-hedge-fund (43K stars) | PolySwarm | K-Fish | |
|---|---|---|---|
| Target | Equities | Prediction markets | Prediction markets |
| Agents | 18 (investor personas) | 50 (diverse personas) | 9 (orthogonal reasoning) |
| Calibration | None | Confidence-weighted | netcal auto-select (Beta/Histogram/Isotonic) |
| Multi-round | No | No | Yes (Delphi with convergence) |
| Pre-screen | No | No | Yes (3-Fish filter for unknowable markets) |
| Cost | API calls ($) | API calls ($) | $0.00 (CLI mode) |
| Risk mgmt | Position limits | Quarter-Kelly | Quarter-Kelly + GARCH volatility + drawdown circuit breaker |
| Validated | Backtest only | Backtest only | 200-market retrodiction on resolved markets |
| Persistence | None | None | SQLite (survives restart) |
| Paper trading | No | No | Yes (full daemon loop) |
| AI bias exploit | No | No | RLHF hedging decompressor + cross-market arbitrage |
Note
Novel contribution: instead of only predicting events, K-Fish detects where AI traders are systematically wrong and exploits the bias.
flowchart TD
INPUT["`**Fish Predictions**
9 probabilities + reasoning text`"]
INPUT --> L1
subgraph DETECT [" 5-Layer Bias Detection "]
direction TB
L1["`**Layer 1 — Reasoning Coherence**
Does the text contradict the number?
Directional keywords vs stated probability`"]
L2["`**Layer 2 — Distribution Shape**
Is the swarm split (bimodal) or clustered?
Split = disagreement, not uncertainty`"]
L3["`**Layer 3 — Knowledge Cutoff**
Do Fish reference training data limits?
30%+ cutoff mentions = knowledge gap`"]
L4["`**Layer 4 — Confidence Paradox**
High confidence + neutral probability?
Confident about 0.50 = RLHF artifact`"]
L5["`**Layer 5 — Self-Calibration**
Track per-regime Brier scores
Learn which actions actually work`"]
L1 --> L2 --> L3 --> L4 --> L5
end
L5 --> R{"`**Regime
Classification**`"}
R -->|"R-P gap > 0.15
reasoning contradicts number"| D["`**RLHF Compression**
Decompress: 0.51 → 0.65
Trade on decompressed probability`"]
R -->|"30%+ Fish reference cutoff
post-training event"| C["`**Knowledge Gap**
Blend with crowd price
They have info we lack`"]
R -->|"Low R-P gap, low confidence
both AI and crowd near 0.50"| S["`**Genuine Uncertainty**
Skip — no edge exists
Save compute for better markets`"]
How the decompressor works
When RLHF compression is detected (Fish reasoning says "strong evidence for YES" but probability is 0.52), the decompressor estimates the pre-hedging probability:
| Signal | Value |
|---|---|
| Fish stated probability | 0.52 |
| Reasoning direction score | +0.87 (strong YES) |
| Implied probability | 0.85 |
| Confidence weight | 0.60 |
| Decompressed | 0.655 |
| Market crowd price | 0.72 |
The decompressed probability (0.655) is much closer to the crowd truth (0.72) than the raw output (0.52). The RLHF penalty was hiding 15 percentage points of directional signal.
Cross-market arbitrage
Detects logically inconsistent prices across related markets:
| Type | Example | Detection |
|---|---|---|
| Subset violation | P("GPT-6 released") > P("OpenAI releases model") | Buy NO on subset, YES on superset |
| Complement violation | P(A) + P(not A) != 1.0 | Arbitrage the gap |
| Spread mispricing | Correlated markets with excessive price spread | Hedged pair trade |
Constructs hedged pair positions with full 4-scenario P&L analysis.
Important
v5 transforms K-Fish from a research prototype into a production trading system with persistence, execution, monitoring, and safety controls.
graph TD
subgraph PERSIST ["💾 Persistence (Phase 1)"]
DB["SQLite database\npredictions · positions · calibration\nresolutions · system state"]
end
subgraph RETRO ["📊 Statistical Validity (Phase 2)"]
R200["200-market retrodiction\nBrier 0.206 · BSS +17.6%\nbootstrap CIs · per-category breakdown"]
end
subgraph EXEC ["⚡ Live Execution (Phase 3)"]
EX["Polymarket CLOB executor\n5 safety checks · paper default\nposition manager · reconciliation"]
end
subgraph TEST ["🧪 Test Coverage (Phase 4)"]
T120["120 tests passing\nHypothesis property-based\nunit + integration"]
end
subgraph MON ["📈 Monitoring (Phase 5)"]
DASH["Track record dashboard\nJSONL alerting\ngraceful degradation"]
end
subgraph PAPER ["📋 Paper Trading (Phase 6)"]
PT["Daemon loop (6h cycles)\ndaily/weekly reports\ngo/no-go checklist"]
end
PERSIST --> RETRO --> EXEC --> TEST --> MON --> PAPER
6 Safety Rules (non-negotiable)
| Rule | Enforcement |
|---|---|
| Paper trading is default | paper_trading=True in all constructors. --live flag + typed confirmation required. |
| Private keys never in code | Environment variables only. .env is gitignored. |
| Position limits are hard caps | Enforced at executor level: max $50/position, max $300 exposure. |
| Drawdown halt is automatic | Trading stops at -15%. Persisted to DB. Manual --reset-drawdown to resume. |
| Reconciliation runs daily | DB vs on-chain check via --reconcile flag. |
| Gradual escalation | Week 1-2: $25/pos. Week 3-4: $50/pos. Month 2+: evaluate. |
K-Fish runs entirely on the Claude Code CLI (claude -p), which uses the Max subscription at no additional API cost.
| Backend | Cost | Speed | Automated | GPU |
|---|---|---|---|---|
CLI (claude -p) |
$0.00 | ~15s/Fish | Yes | No |
| Ollama (local) | $0.00 | ~5s/Fish | Yes | Yes |
| Gemini (free tier) | $0.00 | ~3s/Fish | Yes | No |
| File (manual) | $0.00 | Manual | No | No |
# Clone and install
git clone https://github.com/ksk5429/quant.git && cd quant
pip install -e ".[dev]"
pip install netcal scoringrules quantstats trafilatura statsforecast mlflow sentence-transformers# Scan live Polymarket markets
python -m src.markets.scanner --min-volume 100000
# Start paper trading daemon (6-hour cycles, $0 cost)
bash scripts/start_paper_trading.sh
# Daily performance check
bash scripts/daily_report.sh
# Weekly statistical review with go/no-go checklist
bash scripts/weekly_review.shExpected scanner output
K-FISH MARKET SCAN — 2026-04-12 22:39
Active markets scanned: 50
Rank Score Cat Price Vol($) Question
1 3.37 general 48% 10,976,062 Will Jesus Christ return before GTA VI?
2 3.32 politics 27% 22,267,744 Will Gavin Newsom win the 2028 Democratic...
3 3.32 politics 42% 11,871,967 Will J.D. Vance win the 2028 Republican...
4 3.30 geopolitics 38% 13,192,103 Iran x Israel/US conflict ends by April 7?
5 3.26 geopolitics 30% 14,068,338 Russia x Ukraine ceasefire by end of 2026?
# Run retrodiction (evaluate on 30 resolved markets)
python -m src.prediction.run_retrodiction --n 30 --model haiku --concurrent 3
# Run full live pipeline (scan → analyze → portfolio)
python -m src.mirofish.live_pipeline --top 10 --model haikuProject Structure (click to expand)
src/
├── mirofish/ # Swarm Engine
│ ├── engine_v4.py # Canonical pipeline with DB integration
│ ├── llm_fish.py # 9 personas, 4 backends, asymmetric extremization
│ ├── researcher.py # Context gathering Fish
│ ├── swarm_router.py # Category routing + model competition
│ ├── news_context.py # ★ Real-time news retrieval + semantic ranking
│ ├── live_pipeline.py # Scanner → Engine → Portfolio → Report
│ └── ipc.py # File-based IPC for distributed Fish
├── prediction/ # Scoring & Calibration
│ ├── calibration.py # netcal v2: Beta/Histogram/auto-select + CRPS
│ ├── ai_bias_detector.py # ★ 5-layer RLHF hedging detector + decompressor
│ ├── advanced_scoring.py # Brier decomposition, bootstrap CI, BSS
│ ├── retrodiction_pipeline.py # ★ Expansion pipeline with parquet output
│ ├── batch_retrodiction.py # Batch evaluation with DB persistence
│ ├── volatility.py # GARCH regime detection
│ └── run_retrodiction.py # CLI-based evaluation runner
├── execution/ # v5 Live Trading
│ ├── polymarket_executor.py # py-clob-client wrapper, 5 safety checks
│ ├── position_manager.py # Execute, resolve, reconcile positions
│ ├── live_loop.py # Production daemon (6h cycles)
│ └── order_types.py # OrderResult, ClosedPosition
├── db/ # v5 Persistence
│ ├── schema.sql # 5 tables: predictions, positions, calibration, etc.
│ └── manager.py # DatabaseManager with context manager
├── reporting/ # v5 Monitoring
│ ├── dashboard.py # Markdown track record generator
│ └── alerts.py # JSONL event alerting
├── risk/ # Position Sizing
│ ├── portfolio.py # Edge detection, Kelly, drawdown monitor
│ ├── arbitrage.py # ★ Cross-market arbitrage + hedged pair trades
│ ├── threshold_optimizer.py # ★ Data-driven edge threshold from retrodiction
│ └── analytics.py # Sharpe/Sortino, Monte Carlo simulation
├── markets/ # Market Data
│ ├── polymarket.py # Gamma + CLOB API clients
│ ├── scanner.py # Live market discovery + ranking
│ ├── history.py # Resolved market scraper (2,500 markets)
│ └── dataset.py # 408K market parquet loader (DuckDB)
├── semantic/ # NLP
│ └── news_extractor.py # trafilatura + sentence-transformers
└── utils/ # Infrastructure
├── cli.py # Claude binary detection
├── experiment_tracker.py # MLflow tracking
└── config.py # YAML config loader
Module Architecture (click to expand)
block-beta
columns 1
block:ENGINE["🔷 ENGINE LAYER"]
columns 3
LP["live_pipeline"] E4["engine_v4"] SR["swarm_router"]
end
space
block:SWARM["🐟 SWARM LAYER"]
columns 3
LF["llm_fish\n9 personas\n4 backends"] RS["researcher\ncontext gathering"] IPC["ipc\ndistributed Fish"]
end
space
block:PRED["📊 PREDICTION LAYER"]
columns 4
CAL["calibration\nnetcal v2"] ADV["advanced_scoring\nBrier · CRPS"] VOL["volatility\nGARCH"] RET["run_retrodiction\nCLI evaluation"]
end
space
block:RISK["🛡️ RISK LAYER"]
columns 2
PF["portfolio\nKelly · edge · drawdown"] AN["analytics\nSharpe · Monte Carlo"]
end
space
block:MKT["🌐 MARKET LAYER"]
columns 4
SC["scanner\nlive discovery"] PM["polymarket\nGamma + CLOB"] HI["history\n2500 resolved"] DS["dataset\n408K parquet"]
end
ENGINE --> SWARM --> PRED --> RISK --> MKT
Key Design Decisions (click to expand)
[!IMPORTANT] Every decision is grounded in peer-reviewed evidence or empirical retrodiction results.
| Decision | Why | Evidence |
|---|---|---|
| 9 orthogonal personas | Structural reasoning diversity drives accuracy | Schoenegger et al., Science Advances 2024 |
| Prices withheld from Fish | Prevents anchoring, preserves independence | PolySwarm, arXiv:2604.03888 |
| Asymmetric extremization | Suppress when Fish disagree (high spread) | Retrodiction: 5 worst markets had high spread |
| 3-Fish pre-screen | Skip unknowable markets | LLM cutoff caused Brier 0.95+ on surprises |
| Quarter-Kelly | Full Kelly has ~25% drawdowns | Kelly 1956 |
| Auto-seeded calibrator | No uncalibrated cold start | Code review: calibration was always a no-op |
| CLI over API | $0 vs $3-15/M tokens | Maximize predictions per dollar |
| Real-time news injection | Fish reason about post-cutoff events | Retrodiction: 5 worst misses were all post-cutoff surprises |
| Data-driven edge threshold | Empirically optimal from 230+ retrodictions | Threshold optimizer sweeps 0-30%, finds where Kelly returns turn positive |
| RLHF decompression | Extract true signal hidden by hedging bias | 5-layer detector: reasoning-probability gap identifies compression |
Libraries (click to expand)
| Library | Purpose | Why This One |
|---|---|---|
| netcal | Probability calibration | 10+ methods, auto-select by sample size |
| scoringrules | CRPS, Brier, log score | JAX/Numba backends |
| quantstats | Portfolio analytics | Sharpe, Sortino, Calmar, Monte Carlo |
| trafilatura | News extraction | 0.958 F1, used by HuggingFace/IBM |
| sentence-transformers | Semantic embeddings | Market similarity, news matching |
| statsforecast | GARCH volatility | 20x faster than pmdarima |
| MLflow | Experiment tracking | Model registry for calibrators |
| mapie | Conformal prediction | Coverage-guaranteed intervals |
| Metric | Value |
|---|---|
| Python source lines | ~18,000 |
| Source modules | 40 |
| Unit tests passing | 120 |
| Code reviews completed | 5 |
| Bugs found and fixed | 42 |
| Retrodiction markets | 230+ (expanding) |
| Resolved market corpus | 5,000 |
| External dataset | 408,863 markets |
| Libraries integrated | 11 |
Tip
Full literature review with 45 references: Literature Review
| Claim | Evidence |
|---|---|
| LLM ensembles match human crowds | Schoenegger et al., Science Advances 2024 |
| Retrieval-augmented LLMs approach superforecasters | Halawi et al., NeurIPS 2024 |
| 50-persona swarm outperforms single-model on Polymarket | PolySwarm, arXiv:2604.03888 |
| RLHF models are overconfident, need calibration | Geng et al., NAACL 2024 |
| Semantic similarity outperforms price correlation | Baaijens et al., Applied Network Science 2025 |
graph TD
P1["✅ <b>Phase 1</b> — Foundation<br/>Core engine · Literature review · Polymarket API"]
P2["✅ <b>Phase 2</b> — Swarm Intelligence<br/>9 Fish personas · Multi-round Delphi · CLI execution"]
P3["✅ <b>Phase 3</b> — Calibration<br/>netcal integration · Retrodiction baseline"]
P4["✅ <b>Phase 4</b> — Risk Management<br/>Kelly sizing · Edge detection · Drawdown monitor"]
P5["✅ <b>Phase 5</b> — v5 Persistence<br/>SQLite DB · 200-market retrodiction · Brier 0.206"]
P6["✅ <b>Phase 6</b> — v5 Execution<br/>CLOB executor · Position manager · 120 tests · Dashboard"]
P7["🔄 <b>Phase 7</b> — Paper Trading<br/>2-4 weeks validation · Go/no-go checklist"]
P8["⬜ <b>Phase 8</b> — Live Trading<br/>Real capital · Gradual escalation · GRPO fine-tuning"]
P1 --> P2 --> P3 --> P4 --> P5 --> P6 --> P7 --> P8
style P1 fill:#0a2910,stroke:#3fb950,color:#3fb950
style P2 fill:#0a2910,stroke:#3fb950,color:#3fb950
style P3 fill:#0a2910,stroke:#3fb950,color:#3fb950
style P4 fill:#0a2910,stroke:#3fb950,color:#3fb950
style P5 fill:#0a2910,stroke:#3fb950,color:#3fb950
style P6 fill:#0a2910,stroke:#3fb950,color:#3fb950
style P7 fill:#1a1a2e,stroke:#58a6ff,color:#58a6ff
style P8 fill:#1a1a2e,stroke:#8b949e,color:#8b949e
Built with structured human-AI collaboration · Paper trading is the default · Live trading requires explicit human approval