A from-scratch Python implementation of "Self-Consistency Improves Chain of Thought Reasoning in Language Models" (Wang et al., ICLR 2023).
Standard CoT prompting samples one reasoning chain at temperature 0.
Self-Consistency samples N chains at temperature 0.7, extracts the numeric answer from each, then majority-votes across them.
The intuition: correct reasoning paths all converge on the same answer, while errors tend to be diverse and cancel out.
question ──► [prompt + few-shot CoT]
│
├──► sample 1 ──► parse ──► 8
├──► sample 2 ──► parse ──► 8
├──► sample 3 ──► parse ──► 9 ← outlier
├──► ...
└──► sample N ──► parse ──► 8
majority_vote([8, 8, 9, 8, ...]) ──► answer = 8 ✓
Wang et al. report +17.9 pp accuracy on GSM8K (67.9 % → 85.7 %) over standard CoT using PaLM 540B, and consistent gains across 9 benchmarks without any training or extra supervision.
- Zero external dependencies — pure Python 3.11 stdlib
- OpenRouter back-end — drop in any model via
OPENROUTER_API_KEY - Parallel sampling — N calls via
ThreadPoolExecutor(respects rate limits) - Robust answer parsing — handles
$8,8,000,3/4,50%, decimal, signed - GSM8K benchmark mode — compare SC vs CoT vs direct I/O side-by-side
- 29 offline unit tests — no LLM calls needed for the test suite
git clone https://github.com/MONISMALIK1/self_consistency
cd self_consistency
pip install -e .
export OPENROUTER_API_KEY="sk-or-..." # https://openrouter.ai/keys# Solve a single problem (N=8 samples, majority vote)
python -m self_consistency "Janet's ducks lay 16 eggs per day. She eats 3 for breakfast and bakes 4 into muffins. She sells the remainder for $2 each. How much does she make daily?"
# Plain CoT baseline (N=1)
python -m self_consistency "..." --n 1
# Direct I/O baseline (no reasoning)
python -m self_consistency "..." --io
# Download the GSM8K test set (no API calls)
python -m self_consistency --download
# Benchmark: SC vs CoT on 25 problems
python -m self_consistency --bench --num 25 --methods sc,cot
# All three methods on 10 problems
python -m self_consistency --bench --num 10 --methods sc,cot,iofrom self_consistency import solve, is_correct, load_test
sol = solve("Janet's ducks lay 16 eggs...", n=8)
print(sol.answer) # Fraction(18)
print(sol.votes) # 6 (6 of 8 chains agreed)
print(sol.distribution) # [('18', 6), ('16', 1), ('20', 1)]
# Benchmark
for p in load_test(n=20):
sol = solve(p["question"], n=8)
print(is_correct(sol.answer, p["gold"]))self_consistency/
├── llm.py OpenRouter HTTP wrapper + parallel n-sampling
├── prompts.py 8-shot GSM8K CoT prompt (Wei et al. 2022)
├── gsm8k.py Dataset loader + Fraction-based answer parser
├── core.py majority_vote() + solve() → Solution
├── __init__.py Public API
├── __main__.py CLI
└── tests/
├── test_core.py Voting logic + mocked solve()
└── test_gsm8k.py Parser edge cases + dataset loader
| Env var | Default | Purpose |
|---|---|---|
OPENROUTER_API_KEY |
(required) | OpenRouter API key |
SC_MODEL |
openai/gpt-oss-120b:free |
Model slug |
SC_CONCURRENCY |
4 |
Parallel threads for sample_n |
SC_DATA_DIR |
~/.cache/self_consistency |
GSM8K cache location |
No API key required:
python -m unittest discover -s self_consistency/tests -t . -v@inproceedings{wang2023selfconsistency,
title = {Self-Consistency Improves Chain of Thought Reasoning in Language Models},
author = {Xuezhi Wang and Jason Wei and Dale Schuurmans and Quoc Le and
Ed Chi and Sharan Narang and Aakanksha Chowdhery and Denny Zhou},
booktitle = {ICLR},
year = {2023},
url = {https://arxiv.org/abs/2203.11171}
}