Institutional-grade financial reasoning engine built on Qwen 2.5 14B Instruct. Fine-tuned on 17,000 gold-standard SEC filing analyses across 580+ companies over 5 years. Specialized for evidence-grounded causal inference, uncertainty calibration, and structured analytical reasoning.
| Version | Link |
|---|---|
| Phase 1 — SFT | fcyber/FinReasoner-qwen2.5-14b-instruct-phase1 |
| Phase 3 — DPO | fcyber/FinReasoner-qwen2.5-14b-instruct-phase3 |
| Final Production | fcyber/FinReasoner-qwen2.5-14b-instruct |
FinReasoner is NOT a general financial chatbot.
It is a financial analytical reasoning engine designed for:
- Evidence-grounded causal analysis of SEC filings (10-K, 10-Q, 10-Q/A)
- Structured multi-metric reasoning with explicit calculation transparency
- Calibrated uncertainty expression when evidence is absent or conflicting
- Longitudinal trend analysis across fiscal periods
- Market reaction integration with abnormal return context
- Segment-level performance decomposition
Reliability > Creativity
Grounding > Eloquence
Calibration > Verbosity
Evidence > Inference
Financial analysis lives across three silos that never talk to each other — numbers in spreadsheets, narratives in PDF filings, and market reactions in price charts. No existing tool connects all three consistently. A standard LLM handles prose fluently but fails with numbers. A quant model handles numbers but is blind to the textual disclosures that explain them. Human analysts connect all three, but only for a handful of companies, with bias and inconsistency baked in.
FinReasoner is trained to close that gap.
Every training record enforces a single inference chain:
[Metric Change] → [Root Cause from Filing Text] → [Forward Implication]
This is not summarization. It is causal inference grounded in evidence — the mechanism connecting a number to a business outcome to an investment signal.
The model solves four concrete problems:
-
Hallucination. Every causal claim in every training record is cross-validated against source metrics via the
cross_ref_validatedflag and dual-agent auditor. The model learns to say "cannot be determined from available data" rather than invent a driver. -
Narrative-metric disconnect. The model is trained to detect when management language contradicts what the income statement, balance sheet, and cash flow statement actually show — a Conflict Rationale that basic models miss entirely.
-
Human bottleneck. A skilled analyst processes 5–10 filings per day. FinReasoner processes a 200-page 10-K in seconds with consistent logic, identical rigor, and no fatigue.
-
Multi-period blindness. Trained on rolling multi-year windows with pre-computed CAGRs and margin trajectories, the model understands whether this quarter's compression is a new trend or a reversion — a question impossible to answer from a single document.
FinReasoner does not summarize filings. It audits corporate performance against market expectations and produces structured, evidence-constrained, causally-grounded reasoning at the speed and scale no human team can match.
- Fabricate metrics not present in the provided filing
- Silently estimate missing values
- Express false confidence on weak evidence
- Force conclusions when evidence is insufficient
When data is absent the model explicitly states:
"not disclosed"
"cannot be determined"
"evidence is inconclusive"
"estimated from available disclosures"
"calculated from reported values"
| Dimension | Value |
|---|---|
| Total companies | 580+ |
| S&P 500 constituents | 500 |
| Additional companies | 50+ |
| Temporal window | 5 years trailing |
| Filing types | 10-K, 10-Q, 10-Q/A |
| Total LLM calls in data pipeline | 31,000 |
| Final gold-standard records | 17,000 |
| Quality threshold | Score ≥ 80 |
| Component | Choice |
|---|---|
| Base model | Qwen 2.5 14B Instruct |
| Fine-tuning method | rsLoRA |
| Quantization | 4-bit NF4 |
| Compute dtype | BF16 |
| Layer strategy | Frozen embeddings + lower layers, adapted upper blocks |
| LoRA rank | 16 |
| LoRA alpha | 32 |
| LoRA dropout | 0.05 |
| Framework | Unsloth |
| Infrastructure | Lambda Labs A100 80GB |
FinReasoner/
│
├── phase1_training.py # SFT fine-tuning with rsLoRA
├── phase2_evaluation.py # Comprehensive benchmark evaluation
├── phase3_dpo.py # DPO alignment refinement + HF push
│
├── dataset_score80plus.jsonl # Filtered training dataset (score ≥ 80)
├── eval_set.jsonl # Held-out evaluation set
├── preference_pairs.jsonl # DPO preference pairs (auto-generated)
│
└── logs/
├── phase1_training.log
├── phase2_evaluation.log
└── phase3_dpo.log
Phase 0 executes an institutional-grade data ingestion, harmonization, and semantic synthesis pipeline. Raw regulatory disclosures and market pricing data are transformed into a high-density, multi-year causal dataset through five distinct engineering stages.
The core database aggregates information from public corporate disclosures and external market data feeds over a strict trailing five-year temporal window.
Sources:
- SEC EDGAR: Form 10-K (annual) and Form 10-Q (quarterly) filings
- Yahoo Finance: Historical pricing data for abnormal return calculation
Universe:
- Full S&P 500 index
- 50+ additional companies selected for cross-industry variety and complex capital structures
Every record contains the complete accounting history of a firm alongside its associated real-world market valuation.
Raw SEC filings present significant variations in reporting methodology. Three critical normalization operations are applied:
Scaling Normalization
scaling_unit key normalizes all numbers reported in thousands,
millions, or billions into uniform absolute floats.
Decumulation
decumulation_applied flag triggers discrete mathematical subtractions
to extract true standalone non-cumulative metrics for individual quarters.
Corporate issuers report quarterly numbers on a cumulative year-to-date
basis in later quarters. Without decumulation, Q3 figures would contain
Q1 + Q2 + Q3 data, not Q3 alone.
Year-over-Year Alignment
quarter_aligned_yoy validates that year-over-year quarterly comparisons
map precisely to equivalent operational windows of the prior fiscal period.
For every 10-K and 10-Q filing parsed, the ingestion engine:
- Records the exact filing date
- Defines a precise post-filing trading window via
window_start,window_end,window_days - Calculates total asset performance via
stock_return_pct - Cross-references against benchmark via
spy_return_pct - Isolates true alpha via
abnormal_return_pct
abnormal_return_pct = stock_return_pct - spy_return_pct
This quantitative alpha metric is mapped to a categorical market_reaction token, enabling the model to evaluate financial performance in the context of real-world investor expectations.
An automated validation layer acts as a hard filter before any LLM processing occurs. Three primary control metrics govern this process:
| Metric | Function |
|---|---|
sanity_flags_count |
Records exact number of logical or arithmetic warnings |
sanity_confidence |
Assigns qualitative reliability ranking to parsed source |
sanity_action |
Gateway control — only pass records enter training |
Only records achieving sanity_action: pass proceed to the synthesis stage. This guarantees the learning system trains exclusively on mathematically sound corporate reporting.
The final training set is produced through an expansive multi-agent generator-auditor pipeline executing exactly 40,000 total LLM calls.
40,000 LLM calls
↓
Generator produces structured causal analysis
↓
Auditor cross-references every claim against raw financials
↓
cross_ref_validated → True / False
causal_density_score calculated
↓
Rejection or regeneration if evidence linkage insufficient
↓
17,000 gold-standard records
Generator responsibilities:
- Parse raw financial arrays and prose sections
- Build multi-layered longitudinal analysis
- Track changes over time with distinct financial signals
- Synthesize narrative instruction pairs with historical trends, segment performance, and capital allocation strategies
Auditor responsibilities:
- Cross-reference every text statement against raw financial numbers
- Verify every observation is anchored to accounting realities
- Set
cross_ref_validatedflag - Verify data blend via
data_source_mix - Score multi-step reasoning via
causal_density_score - Trigger rejection or forced regeneration on shallow synthesis
Result: 17,000 elite high-fidelity records for causal financial intelligence.
| Quality Score | Proportion |
|---|---|
| 100 | 60.7% |
| 90 | 28.7% |
| 85 | 3.9% |
| 80 | 6.1% |
| < 80 (excluded) | 0.6% |
All records scoring below 80 were excluded prior to training. Records below 80 exhibited LOW sanity_confidence, high causal_density on weak evidence, and minimal text grounding — a hallucination risk pattern.
{
"id": "e5e12426a61a",
"meta": {
"ticker": "A",
"company_name": "Agilent Technologies Inc",
"form_type": "10-K",
"filed_date": "2022-12-21",
"period_of_report": "2022-10-31",
"fiscal_year": 2022,
"abnormal_return": {
"abnormal_return_pct": 1.388,
"market_reaction": "neutral"
},
"sanity_action": "pass",
"sanity_confidence": "high",
"sanity_flags_count": 0,
"scaling_unit": "millions"
},
"metrics": { "..." },
"changes": { "..." },
"signals": { "..." },
"instruction_pair": {
"instruction": "...",
"input": "...",
"output": "...",
"metadata": {
"quality_score": 90,
"hallucination_flag": false,
"causal_density_score": 17,
"cross_ref_validated": true
}
}
}Objective: Adapt the model to financial analytical reasoning without destroying its general reasoning priors.
| Parameter | Value |
|---|---|
| Method | rsLoRA |
| Epochs | 3 |
| Batch size (effective) | 16 |
| Learning rate | 1e-4 |
| LR scheduler | Cosine |
| Warmup ratio | 5% |
| Gradient clipping | 1.0 |
| Sequence length | 4096 |
| Early stopping | Yes — best checkpoint saved |
What Phase 1 optimizes:
- Formatting fidelity
- Evidence grounding discipline
- Uncertainty language consistency
- Evidence-to-conclusion linkage
- Disciplined refusal on insufficient evidence
Output: fcyber/FinReasoner-qwen2.5-14b-instruct-phase1
Objective: Rigorously measure what the model actually learned.
| Benchmark | Target |
|---|---|
| Hallucination rate | < 5% |
| Uncertainty language | > 60% |
| Causal discipline | > 80% |
| Valid JSON format | > 95% |
| Evidence rich outputs | > 80% |
| Calibrated confidence | > 85% |
Decision rule:
All targets met → Deploy Phase 1 model. Skip Phase 3.
Any target fails → Fix dataset or proceed to Phase 3.
Objective: Push model away from fabricated overconfident outputs toward grounded calibrated reasoning.
Only executed if Phase 2 reveals:
- Hallucination > 5%
- Calibration < 85%
- Uncertainty language < 60%
| Parameter | Value |
|---|---|
| Method | DPO |
| Epochs | 1 |
| Learning rate | 5e-5 |
| Beta | 0.1 |
| Preference pairs | Score ≥ 90 examples only |
Preference pair structure:
Chosen → grounded, calibrated, evidence-constrained output
Rejected → fabricated, overconfident, unsupported output
Output: fcyber/FinReasoner-qwen2.5-14b-instruct (final production)
| Benchmark | What It Measures |
|---|---|
| Hallucination rate | Claims not supported by input evidence |
| Unsupported metric rate | Metrics referenced but not in input |
| Uncertainty calibration | Confidence proportional to evidence strength |
| Contradiction handling | Correct behavior on conflicting evidence |
| Held-out company generalization | Reasoning transfer to unseen companies |
| Causal attribution legitimacy | Are causal claims licensed by evidence |
| Test | Description |
|---|---|
| Missing denominators | Ratio calculation with absent denominator |
| Swapped fiscal years | Mislabeled period detection |
| Restated metrics | Pre/post restatement handling |
| Misleading management language | Spin vs reported numbers |
| Corrupted tables | Missing cell detection |
| Incomplete quarterly disclosures | Partial period boundary enforcement |
| Intentionally ambiguous evidence | Forced refusal on unresolvable cases |
EXL2 — optimized for fast token generation on A100
GGUF Q4_K_M — wide framework compatibility
from unsloth import FastLanguageModel
from huggingface_hub import login
import torch
# ── ONLY NEEDED IF REPO IS PRIVATE ──────────────────────
login() # uses cached token from huggingface-cli login
# ────────────────────────────────────────────────────────
# ── LOAD MODEL (downloads automatically on first run) ────
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "fcyber/FinReasoner-qwen2.5-14b-instruct",
max_seq_length = 4096,
dtype = torch.bfloat16,
load_in_4bit = True,
)
FastLanguageModel.for_inference(model)
# ── INFERENCE ────────────────────────────────────────────
messages = [
{"role": "system", "content": "You are an Elite Senior Financial Analyst..."},
{"role": "user", "content": "COMPANY: Agilent Technologies..."},
]
prompt = tokenizer.apply_chat_template(
messages,
tokenize = False,
add_generation_prompt = True,
)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(
**inputs,
max_new_tokens = 1024,
temperature = 0.1,
do_sample = True,
pad_token_id = tokenizer.eos_token_id,
)
response = tokenizer.decode(
outputs[0][inputs["input_ids"].shape[1]:],
skip_special_tokens=True
)
print(response)All three phases produce structured logs via Loguru.
logs/phase1_training.log ← training loss, warnings, push status
logs/phase2_evaluation.log ← per-record hallucination, calibration flags
logs/phase3_dpo.log ← DPO training, preference pair build, push status
Log format:
2026-05-15 14:32:01 | INFO | Starting Phase 1 training
2026-05-15 14:32:45 | SUCCESS | rsLoRA applied successfully
2026-05-15 16:44:12 | WARNING | Hallucination detected | ticker=VTRS | invented_numbers=7
2026-05-15 18:01:33 | SUCCESS | Phase 1 model pushed to: https://huggingface.co/fcyber/...
Phase 3 worse than Phase 1 in evaluation?
→ Roll back to Phase 1 model
→ Push Phase 1 directly to production repo
→ fcyber/FinReasoner-qwen2.5-14b-instruct
Never overwrite a working model without evaluation confirmation.
Standard LoRA has unstable gradient scaling at higher ranks. rsLoRA applies mathematically correct scaling that produces stable gradients regardless of rank — critical for analytical reasoning tasks sensitive to gradient noise.
rsLoRA adapters overfit faster than full fine-tuning due to limited parameter space. At 20k examples with 4096 sequence length, 3 epochs with early stopping provides sufficient signal without memorization. Early stopping via load_best_model_at_end ensures the best checkpoint is always used.
The model already has five layers of implicit regularization: 4-bit NF4 quantization, gradient clipping, cosine scheduler, early stopping, and limited adapter rank. High dropout would kill the learning signal in a small adapter parameter space.
Conservative beta prevents over-refusal. Aggressive beta values push the model so hard away from rejected outputs that it begins refusing when it should answer. Start at 0.1, increase only if hallucination persists after Phase 3.
Dual GPU adds synchronization complexity, distributed instability, and debugging overhead without benefit at this scale. The rsLoRA + 4-bit configuration peaks at ~40–50GB VRAM, well within single A100 capacity.
pip install unsloth
pip install -U xformers --index-url https://download.pytorch.org/whl/cu121
pip install trl datasets transformers accelerate bitsandbytes loguru huggingface_hubhuggingface-cli login
# Token: https://huggingface.co/settings/tokens@misc{finreasoner2024,
author = {fcyber},
title = {FinReasoner: Evidence-Grounded Financial Reasoning Engine},
year = {2024},
publisher = {HuggingFace},
url = {https://huggingface.co/fcyber/FinReasoner-qwen2.5-14b-instruct}
}This model is intended for research and institutional use. Base model license: Qwen 2.5 — see Qwen License