insight-verify

How accurate are financial influencers' predictions, really?

This system automatically extracts predictions from financial blogs and newsletters (Korean + English), then verifies them against an 88K+ news headline database. No manual labeling — fully automated extraction, matching, and verdict.

Live Dashboard · 한국어 · Experiment Log

Key Findings

Metric	Value
Overall accuracy	69.2% (731 correct / 1,056 verified)
vs. headline sentiment baseline	+14.9pp (baseline: 54.2%)
Predictions tracked	5,180 across 3 sources
News headline DB	88,713 articles (2022–2026, Korean + English)
Verification cost	$0.0045/prediction — 87% reduction from $0.035
Monthly operating cost	~$0.50

I measured a headline-sentiment baseline on the same dataset: using keyword-matched headlines' bullish/bearish word counts to predict direction yields 54.2% accuracy (n=1,041). The blogger beats this by +14.9pp, with the gap widest on bullish calls (77.8% vs 59.9%). This suggests the predictions carry genuine signal beyond what's already priced into the news cycle.

Source Leaderboard

Source	Predictions	Verified	Accuracy
mer_ranto28 (Korean macro blog)	5,010	1,052	69.3%
arthur_hayes (Crypto Trader Digest)	150	4	50.0%*

* Hayes verification in progress — most predictions are long-term (2026–2028 expected dates). 66 predictions matched headlines but received PENDING verdicts due to insufficient evidence. Results will accumulate as expected dates pass.

Data integrity note

During early batch verification, a 50-case blind audit revealed 36% contamination rate (false CORRECT verdicts from insufficient matching). All 1,047 verdicts were reset and re-verified 1-by-1 with stricter matching criteria (MIN_KEYWORD_OVERLAP=3, source_url required). Current results reflect the post-reset verified dataset.

How it Works

Blog/newsletter → Claude extraction → structured predictions (claim, keywords, expected_date)
    → keyword GIN matching against 88K headlines → vector cosine fallback on miss
    → matched headlines + prediction → Haiku verdict → CORRECT / INCORRECT / PENDING

flowchart TD
    subgraph Collection
        BLOG[Naver Blog] -->|RSS| MON[source_collector]
        SUB[Substack] -->|RSS| RSS[rss_collector]
        MON -->|Claude extraction| DB[(PostgreSQL + pgvector)]
        RSS -->|Claude extraction| DB
        NEWS_RSS[Google News 33 feeds] --> NC[news_collector]
        NEWS_NAVER[Naver API 14 queries] --> NC
        NC --> NHL[(news_headlines 88K)]
    end

    subgraph Verification
        DB -->|PENDING predictions| HM[headline_matcher]
        NHL -->|1st keyword GIN<br>2nd vector cosine| HM
        HM -->|matched headlines| HAIKU[Haiku Batch API]
        HAIKU -->|verdict + reason| DB
    end

    DB --> DASH[Streamlit Dashboard]

Daily pipeline (GitHub Actions, KST 01:00):

Collect new posts from all active sources + extract insights via Claude
Collect news headlines (33 RSS feeds + Naver API)
Auto-verify (headline matching → Haiku verdict)

Why is this Hard?

Financial blog posts contain no tickers, dates, or confidence levels. Figuring out what counts as a prediction, when it should be verified, and what criteria determine correct vs. incorrect — that structuring problem alone isn't solved by any off-the-shelf tool.

Automated verification is even harder. Using API web_search costs $0.035/prediction (5,000 = $175). After testing 6 different approaches, I settled on news headline DB + keyword/vector hybrid matching + Batch API.

6 approaches compared → 87% cost reduction

Approach	Match rate	Cost/pred	Notes
API only (no search)	16.9%	$0.01	80% PENDING — knowledge cutoff
API + Brave one-shot	37.7%	$0.02	Insufficient snippets
API + web_search (Sonnet)	30%	$0.05	Unstable
API + agentic tool_use (Opus)	40%	$0.26	Token cost explosion
API + web_search 1-by-1 (Haiku)	80%	$0.035	5,000 preds = $175
News DB + Batch API ★	prod	$0.0045	5,000 preds = ~$3

Details: Experiment Log

Current Verification Status

Status	Count
CORRECT	731
INCORRECT	325
Verifiable PENDING	225
Unverifiable (vague/conditional)	2,793
Future (awaiting expected_date)	1,203

Search Infrastructure

Hybrid BM25 + pgvector search retrieves relevant context from 24,385 indexed insights.

α (BM25 weight)	Precision@5	Recall@5	MRR
0.0 (vector)	0.199	0.995	0.995
0.6 (prod)	0.200	1.000	0.968
1.0 (BM25)	0.196	0.980	0.935

α=0.6 — the only setting achieving perfect Recall (1.000).

Tech Stack

Layer	Technology
LLM	Claude Sonnet 4.6 (extraction) / Haiku 4.5 (verification, Batch API)
Embeddings	`intfloat/multilingual-e5-large` (1024-dim, local)
DB	PostgreSQL 16 + pgvector (HNSW)
Search	BM25 (kiwipiepy) + pgvector → RRF fusion
News	Google News RSS + Naver API + feedparser
NLP	kiwipiepy (Korean morphemes) + compound term extraction (English)
Scheduler	GitHub Actions cron
Dashboard	Streamlit Cloud

Quick Start

git clone https://github.com/11e3/insight-verify.git
cd insight-verify
cp .env.example .env  # set API keys

docker compose up -d db
python scripts/run_batch.py all        # extract insights
python -m scripts.run_job              # run pipeline once
streamlit run src/dashboard/app.py     # dashboard

Project Structure

src/
├── collect/     # Data collection (blog, Substack RSS, news)
├── extract/     # Claude insight extraction (Batch + realtime)
├── verify/      # Auto-verification (headline_matcher + auto_verifier)
├── search/      # Hybrid search (BM25 + pgvector)
├── embed/       # Embeddings (multilingual-e5-large)
├── eval/        # Search quality evaluation (ablation, LLM judge)
├── pipeline/    # Daily pipeline orchestrator
├── dashboard/   # Streamlit dashboard
└── config/      # Settings, prompts (Korean + English)

scripts/ops/     # Batch verification, news backfill, Substack backfill, data ops

Cost

Component	Cost
Insight extraction	~$0.01/post
Auto-verification	~$0.0045/pred
News collection	$0
Embeddings	$0 (local)
Monthly operations	~$0.50

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 290 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
.streamlit		.streamlit
config		config
docs		docs
eval_data		eval_data
scripts		scripts
src		src
tests		tests
.coveragerc		.coveragerc
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
README.md		README.md
README_KR.md		README_KR.md
app.py		app.py
docker-compose.yml		docker-compose.yml
pytest.ini		pytest.ini
requirements-full.txt		requirements-full.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

insight-verify

Key Findings

Source Leaderboard

How it Works

Why is this Hard?

Current Verification Status

Search Infrastructure

Tech Stack

Quick Start

Project Structure

Cost

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

insight-verify

Key Findings

Source Leaderboard

How it Works

Why is this Hard?

Current Verification Status

Search Infrastructure

Tech Stack

Quick Start

Project Structure

Cost

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages