FinReasoner-qwen2.5-14b-instruct

Institutional-grade financial reasoning engine built on Qwen 2.5 14B Instruct. Fine-tuned on 17,000 gold-standard SEC filing analyses across 580+ companies over 5 years. Specialized for evidence-grounded causal inference, uncertainty calibration, and structured analytical reasoning.

Model Cards on HuggingFace

Version	Link
Phase 1 — SFT	fcyber/FinReasoner-qwen2.5-14b-instruct-phase1
Phase 3 — DPO	fcyber/FinReasoner-qwen2.5-14b-instruct-phase3
Final Production	fcyber/FinReasoner-qwen2.5-14b-instruct

What This Model Does

FinReasoner is NOT a general financial chatbot.

It is a financial analytical reasoning engine designed for:

Evidence-grounded causal analysis of SEC filings (10-K, 10-Q, 10-Q/A)
Structured multi-metric reasoning with explicit calculation transparency
Calibrated uncertainty expression when evidence is absent or conflicting
Longitudinal trend analysis across fiscal periods
Market reaction integration with abnormal return context
Segment-level performance decomposition

Core Behavioral Principles

Reliability > Creativity
Grounding   > Eloquence
Calibration > Verbosity
Evidence    > Inference

The Problem This Model Solves

Financial analysis lives across three silos that never talk to each other — numbers in spreadsheets, narratives in PDF filings, and market reactions in price charts. No existing tool connects all three consistently. A standard LLM handles prose fluently but fails with numbers. A quant model handles numbers but is blind to the textual disclosures that explain them. Human analysts connect all three, but only for a handful of companies, with bias and inconsistency baked in.

FinReasoner is trained to close that gap.

Every training record enforces a single inference chain:

[Metric Change] → [Root Cause from Filing Text] → [Forward Implication]

This is not summarization. It is causal inference grounded in evidence — the mechanism connecting a number to a business outcome to an investment signal.

The model solves four concrete problems:

Hallucination. Every causal claim in every training record is cross-validated against source metrics via the cross_ref_validated flag and dual-agent auditor. The model learns to say "cannot be determined from available data" rather than invent a driver.
Narrative-metric disconnect. The model is trained to detect when management language contradicts what the income statement, balance sheet, and cash flow statement actually show — a Conflict Rationale that basic models miss entirely.
Human bottleneck. A skilled analyst processes 5–10 filings per day. FinReasoner processes a 200-page 10-K in seconds with consistent logic, identical rigor, and no fatigue.
Multi-period blindness. Trained on rolling multi-year windows with pre-computed CAGRs and margin trajectories, the model understands whether this quarter's compression is a new trend or a reversion — a question impossible to answer from a single document.

FinReasoner does not summarize filings. It audits corporate performance against market expectations and produces structured, evidence-constrained, causally-grounded reasoning at the speed and scale no human team can match.

What the Model Will NOT Do

Fabricate metrics not present in the provided filing
Silently estimate missing values
Express false confidence on weak evidence
Force conclusions when evidence is insufficient

When data is absent the model explicitly states:

"not disclosed"
"cannot be determined"
"evidence is inconclusive"
"estimated from available disclosures"
"calculated from reported values"

Universe Coverage

Dimension	Value
Total companies	580+
S&P 500 constituents	500
Additional companies	50+
Temporal window	5 years trailing
Filing types	10-K, 10-Q, 10-Q/A
Total LLM calls in data pipeline	31,000
Final gold-standard records	17,000
Quality threshold	Score ≥ 80

Architecture

Component	Choice
Base model	Qwen 2.5 14B Instruct
Fine-tuning method	rsLoRA
Quantization	4-bit NF4
Compute dtype	BF16
Layer strategy	Frozen embeddings + lower layers, adapted upper blocks
LoRA rank	16
LoRA alpha	32
LoRA dropout	0.05
Framework	Unsloth
Infrastructure	Lambda Labs A100 80GB

Project Structure

FinReasoner/
│
├── phase1_training.py          # SFT fine-tuning with rsLoRA
├── phase2_evaluation.py        # Comprehensive benchmark evaluation
├── phase3_dpo.py               # DPO alignment refinement + HF push
│
├── dataset_score80plus.jsonl   # Filtered training dataset (score ≥ 80)
├── eval_set.jsonl              # Held-out evaluation set
├── preference_pairs.jsonl      # DPO preference pairs (auto-generated)
│
└── logs/
    ├── phase1_training.log
    ├── phase2_evaluation.log
    └── phase3_dpo.log

Data Pipeline — Phase 0

Overview

Phase 0 executes an institutional-grade data ingestion, harmonization, and semantic synthesis pipeline. Raw regulatory disclosures and market pricing data are transformed into a high-density, multi-year causal dataset through five distinct engineering stages.

Stage 1 — Multi-Source Data Ingestion and Universe Alignment

The core database aggregates information from public corporate disclosures and external market data feeds over a strict trailing five-year temporal window.

Sources:

SEC EDGAR: Form 10-K (annual) and Form 10-Q (quarterly) filings
Yahoo Finance: Historical pricing data for abnormal return calculation

Universe:

Full S&P 500 index
50+ additional companies selected for cross-industry variety and complex capital structures

Every record contains the complete accounting history of a firm alongside its associated real-world market valuation.

Stage 2 — Accounting Normalization and Mathematical Adjustments

Raw SEC filings present significant variations in reporting methodology. Three critical normalization operations are applied:

Scaling Normalization

scaling_unit key normalizes all numbers reported in thousands,
millions, or billions into uniform absolute floats.

Decumulation

decumulation_applied flag triggers discrete mathematical subtractions
to extract true standalone non-cumulative metrics for individual quarters.

Corporate issuers report quarterly numbers on a cumulative year-to-date
basis in later quarters. Without decumulation, Q3 figures would contain
Q1 + Q2 + Q3 data, not Q3 alone.

Year-over-Year Alignment

quarter_aligned_yoy validates that year-over-year quarterly comparisons
map precisely to equivalent operational windows of the prior fiscal period.

Stage 3 — Market Reaction and Alternative Data Splicing

For every 10-K and 10-Q filing parsed, the ingestion engine:

Records the exact filing date
Defines a precise post-filing trading window via window_start, window_end, window_days
Calculates total asset performance via stock_return_pct
Cross-references against benchmark via spy_return_pct
Isolates true alpha via abnormal_return_pct

abnormal_return_pct = stock_return_pct - spy_return_pct

This quantitative alpha metric is mapped to a categorical market_reaction token, enabling the model to evaluate financial performance in the context of real-world investor expectations.

Stage 4 — Automated Sanity Filter and Structural Guardrails

An automated validation layer acts as a hard filter before any LLM processing occurs. Three primary control metrics govern this process:

Metric	Function
`sanity_flags_count`	Records exact number of logical or arithmetic warnings
`sanity_confidence`	Assigns qualitative reliability ranking to parsed source
`sanity_action`	Gateway control — only `pass` records enter training

Only records achieving sanity_action: pass proceed to the synthesis stage. This guarantees the learning system trains exclusively on mathematically sound corporate reporting.

Stage 5 — Multi-Agent Synthesis and Critique Loop

The final training set is produced through an expansive multi-agent generator-auditor pipeline executing exactly 40,000 total LLM calls.

40,000 LLM calls
        ↓
Generator produces structured causal analysis
        ↓
Auditor cross-references every claim against raw financials
        ↓
cross_ref_validated → True / False
causal_density_score calculated
        ↓
Rejection or regeneration if evidence linkage insufficient
        ↓
17,000 gold-standard records

Generator responsibilities:

Parse raw financial arrays and prose sections
Build multi-layered longitudinal analysis
Track changes over time with distinct financial signals
Synthesize narrative instruction pairs with historical trends, segment performance, and capital allocation strategies

Auditor responsibilities:

Cross-reference every text statement against raw financial numbers
Verify every observation is anchored to accounting realities
Set cross_ref_validated flag
Verify data blend via data_source_mix
Score multi-step reasoning via causal_density_score
Trigger rejection or forced regeneration on shallow synthesis

Result: 17,000 elite high-fidelity records for causal financial intelligence.

Dataset Quality Distribution (Post-Filter)

Quality Score	Proportion
100	60.7%
90	28.7%
85	3.9%
80	6.1%
< 80 (excluded)	0.6%

All records scoring below 80 were excluded prior to training. Records below 80 exhibited LOW sanity_confidence, high causal_density on weak evidence, and minimal text grounding — a hallucination risk pattern.

Example Record Structure

{
  "id": "e5e12426a61a",
  "meta": {
    "ticker": "A",
    "company_name": "Agilent Technologies Inc",
    "form_type": "10-K",
    "filed_date": "2022-12-21",
    "period_of_report": "2022-10-31",
    "fiscal_year": 2022,
    "abnormal_return": {
      "abnormal_return_pct": 1.388,
      "market_reaction": "neutral"
    },
    "sanity_action": "pass",
    "sanity_confidence": "high",
    "sanity_flags_count": 0,
    "scaling_unit": "millions"
  },
  "metrics": { "..." },
  "changes": { "..." },
  "signals": { "..." },
  "instruction_pair": {
    "instruction": "...",
    "input": "...",
    "output": "...",
    "metadata": {
      "quality_score": 90,
      "hallucination_flag": false,
      "causal_density_score": 17,
      "cross_ref_validated": true
    }
  }
}

Training Pipeline

Phase 1 — Supervised Fine-Tuning (SFT)

Objective: Adapt the model to financial analytical reasoning without destroying its general reasoning priors.

Parameter	Value
Method	rsLoRA
Epochs	3
Batch size (effective)	16
Learning rate	1e-4
LR scheduler	Cosine
Warmup ratio	5%
Gradient clipping	1.0
Sequence length	4096
Early stopping	Yes — best checkpoint saved

What Phase 1 optimizes:

Formatting fidelity
Evidence grounding discipline
Uncertainty language consistency
Evidence-to-conclusion linkage
Disciplined refusal on insufficient evidence

Output: fcyber/FinReasoner-qwen2.5-14b-instruct-phase1

Phase 2 — Comprehensive Evaluation

Objective: Rigorously measure what the model actually learned.

Benchmark	Target
Hallucination rate	< 5%
Uncertainty language	> 60%
Causal discipline	> 80%
Valid JSON format	> 95%
Evidence rich outputs	> 80%
Calibrated confidence	> 85%

Decision rule:

All targets met  → Deploy Phase 1 model. Skip Phase 3.
Any target fails → Fix dataset or proceed to Phase 3.

Phase 3 — DPO Alignment Refinement

Objective: Push model away from fabricated overconfident outputs toward grounded calibrated reasoning.

Only executed if Phase 2 reveals:

Hallucination > 5%
Calibration < 85%
Uncertainty language < 60%

Parameter	Value
Method	DPO
Epochs	1
Learning rate	5e-5
Beta	0.1
Preference pairs	Score ≥ 90 examples only

Preference pair structure:

Chosen   → grounded, calibrated, evidence-constrained output
Rejected → fabricated, overconfident, unsupported output

Output: fcyber/FinReasoner-qwen2.5-14b-instruct (final production)

Evaluation Benchmarks

Core Benchmarks

Benchmark	What It Measures
Hallucination rate	Claims not supported by input evidence
Unsupported metric rate	Metrics referenced but not in input
Uncertainty calibration	Confidence proportional to evidence strength
Contradiction handling	Correct behavior on conflicting evidence
Held-out company generalization	Reasoning transfer to unseen companies
Causal attribution legitimacy	Are causal claims licensed by evidence

Adversarial Stress Tests

Test	Description
Missing denominators	Ratio calculation with absent denominator
Swapped fiscal years	Mislabeled period detection
Restated metrics	Pre/post restatement handling
Misleading management language	Spin vs reported numbers
Corrupted tables	Missing cell detection
Incomplete quarterly disclosures	Partial period boundary enforcement
Intentionally ambiguous evidence	Forced refusal on unresolvable cases

Inference

Preferred (Production)

EXL2 — optimized for fast token generation on A100

Fallback

GGUF Q4_K_M — wide framework compatibility

Example Usage

from unsloth import FastLanguageModel
from huggingface_hub import login
import torch

# ── ONLY NEEDED IF REPO IS PRIVATE ──────────────────────
login()   # uses cached token from huggingface-cli login
# ────────────────────────────────────────────────────────

# ── LOAD MODEL (downloads automatically on first run) ────
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name     = "fcyber/FinReasoner-qwen2.5-14b-instruct",
    max_seq_length = 4096,
    dtype          = torch.bfloat16,
    load_in_4bit   = True,
)
FastLanguageModel.for_inference(model)

# ── INFERENCE ────────────────────────────────────────────
messages = [
    {"role": "system",  "content": "You are an Elite Senior Financial Analyst..."},
    {"role": "user",    "content": "COMPANY: Agilent Technologies..."},
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize              = False,
    add_generation_prompt = True,
)

inputs  = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(
    **inputs,
    max_new_tokens = 1024,
    temperature    = 0.1,
    do_sample      = True,
    pad_token_id   = tokenizer.eos_token_id,
)

response = tokenizer.decode(
    outputs[0][inputs["input_ids"].shape[1]:],
    skip_special_tokens=True
)
print(response)

Logging

All three phases produce structured logs via Loguru.

logs/phase1_training.log    ← training loss, warnings, push status
logs/phase2_evaluation.log  ← per-record hallucination, calibration flags
logs/phase3_dpo.log         ← DPO training, preference pair build, push status

Log format:

2026-05-15 14:32:01 | INFO    | Starting Phase 1 training
2026-05-15 14:32:45 | SUCCESS | rsLoRA applied successfully
2026-05-15 16:44:12 | WARNING | Hallucination detected | ticker=VTRS | invented_numbers=7
2026-05-15 18:01:33 | SUCCESS | Phase 1 model pushed to: https://huggingface.co/fcyber/...

Rollback Strategy

Phase 3 worse than Phase 1 in evaluation?
→ Roll back to Phase 1 model
→ Push Phase 1 directly to production repo
→ fcyber/FinReasoner-qwen2.5-14b-instruct

Never overwrite a working model without evaluation confirmation.

Key Design Decisions

- rsLoRA over standard LoRA

Standard LoRA has unstable gradient scaling at higher ranks. rsLoRA applies mathematically correct scaling that produces stable gradients regardless of rank — critical for analytical reasoning tasks sensitive to gradient noise.

- 3 epochs not 5+

rsLoRA adapters overfit faster than full fine-tuning due to limited parameter space. At 20k examples with 4096 sequence length, 3 epochs with early stopping provides sufficient signal without memorization. Early stopping via load_best_model_at_end ensures the best checkpoint is always used.

- lora_dropout = 0.05

The model already has five layers of implicit regularization: 4-bit NF4 quantization, gradient clipping, cosine scheduler, early stopping, and limited adapter rank. High dropout would kill the learning signal in a small adapter parameter space.

- DPO beta = 0.1

Conservative beta prevents over-refusal. Aggressive beta values push the model so hard away from rejected outputs that it begins refusing when it should answer. Start at 0.1, increase only if hallucination persists after Phase 3.

- single A100 80GB not dual

Dual GPU adds synchronization complexity, distributed instability, and debugging overhead without benefit at this scale. The rsLoRA + 4-bit configuration peaks at ~40–50GB VRAM, well within single A100 capacity.

Dependencies

pip install unsloth
pip install -U xformers --index-url https://download.pytorch.org/whl/cu121
pip install trl datasets transformers accelerate bitsandbytes loguru huggingface_hub

HuggingFace Login

huggingface-cli login
# Token: https://huggingface.co/settings/tokens

Citation

@misc{finreasoner2024,
  author    = {fcyber},
  title     = {FinReasoner: Evidence-Grounded Financial Reasoning Engine},
  year      = {2024},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/fcyber/FinReasoner-qwen2.5-14b-instruct}
}

License

This model is intended for research and institutional use. Base model license: Qwen 2.5 — see Qwen License

Author

fcyber

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
phase_1_training.ipynb		phase_1_training.ipynb

Folders and files

Latest commit

History

Repository files navigation

FinReasoner-qwen2.5-14b-instruct

Model Cards on HuggingFace

What This Model Does

Core Behavioral Principles

The Problem This Model Solves

What the Model Will NOT Do

Universe Coverage

Architecture

Project Structure

Data Pipeline — Phase 0

Overview

Stage 1 — Multi-Source Data Ingestion and Universe Alignment

Stage 2 — Accounting Normalization and Mathematical Adjustments

Stage 3 — Market Reaction and Alternative Data Splicing

Stage 4 — Automated Sanity Filter and Structural Guardrails

Stage 5 — Multi-Agent Synthesis and Critique Loop

Dataset Quality Distribution (Post-Filter)

Example Record Structure

Training Pipeline

Phase 1 — Supervised Fine-Tuning (SFT)

Phase 2 — Comprehensive Evaluation

Phase 3 — DPO Alignment Refinement

Evaluation Benchmarks

Core Benchmarks

Adversarial Stress Tests

Inference

Preferred (Production)

Fallback

Example Usage

Logging

Rollback Strategy

Key Design Decisions

- rsLoRA over standard LoRA

- 3 epochs not 5+

- lora_dropout = 0.05

- DPO beta = 0.1

- single A100 80GB not dual

Dependencies

HuggingFace Login

Citation

License

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages