How accurate are financial influencers' predictions, really?
This system automatically extracts predictions from financial blogs and newsletters (Korean + English), then verifies them against an 88K+ news headline database. No manual labeling — fully automated extraction, matching, and verdict.
Live Dashboard · 한국어 · Experiment Log
| Metric | Value |
|---|---|
| Overall accuracy | 69.2% (731 correct / 1,056 verified) |
| vs. headline sentiment baseline | +14.9pp (baseline: 54.2%) |
| Predictions tracked | 5,180 across 3 sources |
| News headline DB | 88,713 articles (2022–2026, Korean + English) |
| Verification cost | $0.0045/prediction — 87% reduction from $0.035 |
| Monthly operating cost | ~$0.50 |
I measured a headline-sentiment baseline on the same dataset: using keyword-matched headlines' bullish/bearish word counts to predict direction yields 54.2% accuracy (n=1,041). The blogger beats this by +14.9pp, with the gap widest on bullish calls (77.8% vs 59.9%). This suggests the predictions carry genuine signal beyond what's already priced into the news cycle.
| Source | Predictions | Verified | Accuracy |
|---|---|---|---|
| mer_ranto28 (Korean macro blog) | 5,010 | 1,052 | 69.3% |
| arthur_hayes (Crypto Trader Digest) | 150 | 4 | 50.0%* |
* Hayes verification in progress — most predictions are long-term (2026–2028 expected dates). 66 predictions matched headlines but received PENDING verdicts due to insufficient evidence. Results will accumulate as expected dates pass.
Data integrity note
During early batch verification, a 50-case blind audit revealed 36% contamination rate (false CORRECT verdicts from insufficient matching). All 1,047 verdicts were reset and re-verified 1-by-1 with stricter matching criteria (MIN_KEYWORD_OVERLAP=3, source_url required). Current results reflect the post-reset verified dataset.
Blog/newsletter → Claude extraction → structured predictions (claim, keywords, expected_date)
→ keyword GIN matching against 88K headlines → vector cosine fallback on miss
→ matched headlines + prediction → Haiku verdict → CORRECT / INCORRECT / PENDING
flowchart TD
subgraph Collection
BLOG[Naver Blog] -->|RSS| MON[source_collector]
SUB[Substack] -->|RSS| RSS[rss_collector]
MON -->|Claude extraction| DB[(PostgreSQL + pgvector)]
RSS -->|Claude extraction| DB
NEWS_RSS[Google News 33 feeds] --> NC[news_collector]
NEWS_NAVER[Naver API 14 queries] --> NC
NC --> NHL[(news_headlines 88K)]
end
subgraph Verification
DB -->|PENDING predictions| HM[headline_matcher]
NHL -->|1st keyword GIN<br>2nd vector cosine| HM
HM -->|matched headlines| HAIKU[Haiku Batch API]
HAIKU -->|verdict + reason| DB
end
DB --> DASH[Streamlit Dashboard]
Daily pipeline (GitHub Actions, KST 01:00):
- Collect new posts from all active sources + extract insights via Claude
- Collect news headlines (33 RSS feeds + Naver API)
- Auto-verify (headline matching → Haiku verdict)
Financial blog posts contain no tickers, dates, or confidence levels. Figuring out what counts as a prediction, when it should be verified, and what criteria determine correct vs. incorrect — that structuring problem alone isn't solved by any off-the-shelf tool.
Automated verification is even harder. Using API web_search costs $0.035/prediction (5,000 = $175). After testing 6 different approaches, I settled on news headline DB + keyword/vector hybrid matching + Batch API.
6 approaches compared → 87% cost reduction
| Approach | Match rate | Cost/pred | Notes |
|---|---|---|---|
| API only (no search) | 16.9% | $0.01 | 80% PENDING — knowledge cutoff |
| API + Brave one-shot | 37.7% | $0.02 | Insufficient snippets |
| API + web_search (Sonnet) | 30% | $0.05 | Unstable |
| API + agentic tool_use (Opus) | 40% | $0.26 | Token cost explosion |
| API + web_search 1-by-1 (Haiku) | 80% | $0.035 | 5,000 preds = $175 |
| News DB + Batch API ★ | prod | $0.0045 | 5,000 preds = ~$3 |
Details: Experiment Log
| Status | Count |
|---|---|
| CORRECT | 731 |
| INCORRECT | 325 |
| Verifiable PENDING | 225 |
| Unverifiable (vague/conditional) | 2,793 |
| Future (awaiting expected_date) | 1,203 |
Hybrid BM25 + pgvector search retrieves relevant context from 24,385 indexed insights.
| α (BM25 weight) | Precision@5 | Recall@5 | MRR |
|---|---|---|---|
| 0.0 (vector) | 0.199 | 0.995 | 0.995 |
| 0.6 (prod) | 0.200 | 1.000 | 0.968 |
| 1.0 (BM25) | 0.196 | 0.980 | 0.935 |
α=0.6 — the only setting achieving perfect Recall (1.000).
| Layer | Technology |
|---|---|
| LLM | Claude Sonnet 4.6 (extraction) / Haiku 4.5 (verification, Batch API) |
| Embeddings | intfloat/multilingual-e5-large (1024-dim, local) |
| DB | PostgreSQL 16 + pgvector (HNSW) |
| Search | BM25 (kiwipiepy) + pgvector → RRF fusion |
| News | Google News RSS + Naver API + feedparser |
| NLP | kiwipiepy (Korean morphemes) + compound term extraction (English) |
| Scheduler | GitHub Actions cron |
| Dashboard | Streamlit Cloud |
git clone https://github.com/11e3/insight-verify.git
cd insight-verify
cp .env.example .env # set API keys
docker compose up -d db
python scripts/run_batch.py all # extract insights
python -m scripts.run_job # run pipeline once
streamlit run src/dashboard/app.py # dashboardsrc/
├── collect/ # Data collection (blog, Substack RSS, news)
├── extract/ # Claude insight extraction (Batch + realtime)
├── verify/ # Auto-verification (headline_matcher + auto_verifier)
├── search/ # Hybrid search (BM25 + pgvector)
├── embed/ # Embeddings (multilingual-e5-large)
├── eval/ # Search quality evaluation (ablation, LLM judge)
├── pipeline/ # Daily pipeline orchestrator
├── dashboard/ # Streamlit dashboard
└── config/ # Settings, prompts (Korean + English)
scripts/ops/ # Batch verification, news backfill, Substack backfill, data ops
| Component | Cost |
|---|---|
| Insight extraction | ~$0.01/post |
| Auto-verification | ~$0.0045/pred |
| News collection | $0 |
| Embeddings | $0 (local) |
| Monthly operations | ~$0.50 |
MIT