Skip to content

fcyber-labs/FinReasoner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

FinReasoner-qwen2.5-14b-instruct

Institutional-grade financial reasoning engine built on Qwen 2.5 14B Instruct. Fine-tuned on 17,000 gold-standard SEC filing analyses across 580+ companies over 5 years. Specialized for evidence-grounded causal inference, uncertainty calibration, and structured analytical reasoning.


Model Cards on HuggingFace

Version Link
Phase 1 — SFT fcyber/FinReasoner-qwen2.5-14b-instruct-phase1
Phase 3 — DPO fcyber/FinReasoner-qwen2.5-14b-instruct-phase3
Final Production fcyber/FinReasoner-qwen2.5-14b-instruct

What This Model Does

FinReasoner is NOT a general financial chatbot.

It is a financial analytical reasoning engine designed for:

  • Evidence-grounded causal analysis of SEC filings (10-K, 10-Q, 10-Q/A)
  • Structured multi-metric reasoning with explicit calculation transparency
  • Calibrated uncertainty expression when evidence is absent or conflicting
  • Longitudinal trend analysis across fiscal periods
  • Market reaction integration with abnormal return context
  • Segment-level performance decomposition

Core Behavioral Principles

Reliability > Creativity
Grounding   > Eloquence
Calibration > Verbosity
Evidence    > Inference

The Problem This Model Solves

Financial analysis lives across three silos that never talk to each other — numbers in spreadsheets, narratives in PDF filings, and market reactions in price charts. No existing tool connects all three consistently. A standard LLM handles prose fluently but fails with numbers. A quant model handles numbers but is blind to the textual disclosures that explain them. Human analysts connect all three, but only for a handful of companies, with bias and inconsistency baked in.

FinReasoner is trained to close that gap.

Every training record enforces a single inference chain:

[Metric Change] → [Root Cause from Filing Text] → [Forward Implication]

This is not summarization. It is causal inference grounded in evidence — the mechanism connecting a number to a business outcome to an investment signal.

The model solves four concrete problems:

  • Hallucination. Every causal claim in every training record is cross-validated against source metrics via the cross_ref_validated flag and dual-agent auditor. The model learns to say "cannot be determined from available data" rather than invent a driver.

  • Narrative-metric disconnect. The model is trained to detect when management language contradicts what the income statement, balance sheet, and cash flow statement actually show — a Conflict Rationale that basic models miss entirely.

  • Human bottleneck. A skilled analyst processes 5–10 filings per day. FinReasoner processes a 200-page 10-K in seconds with consistent logic, identical rigor, and no fatigue.

  • Multi-period blindness. Trained on rolling multi-year windows with pre-computed CAGRs and margin trajectories, the model understands whether this quarter's compression is a new trend or a reversion — a question impossible to answer from a single document.

FinReasoner does not summarize filings. It audits corporate performance against market expectations and produces structured, evidence-constrained, causally-grounded reasoning at the speed and scale no human team can match.


What the Model Will NOT Do

  • Fabricate metrics not present in the provided filing
  • Silently estimate missing values
  • Express false confidence on weak evidence
  • Force conclusions when evidence is insufficient

When data is absent the model explicitly states:

"not disclosed"
"cannot be determined"
"evidence is inconclusive"
"estimated from available disclosures"
"calculated from reported values"

Universe Coverage

Dimension Value
Total companies 580+
S&P 500 constituents 500
Additional companies 50+
Temporal window 5 years trailing
Filing types 10-K, 10-Q, 10-Q/A
Total LLM calls in data pipeline 31,000
Final gold-standard records 17,000
Quality threshold Score ≥ 80

Architecture

Component Choice
Base model Qwen 2.5 14B Instruct
Fine-tuning method rsLoRA
Quantization 4-bit NF4
Compute dtype BF16
Layer strategy Frozen embeddings + lower layers, adapted upper blocks
LoRA rank 16
LoRA alpha 32
LoRA dropout 0.05
Framework Unsloth
Infrastructure Lambda Labs A100 80GB

Project Structure

FinReasoner/
│
├── phase1_training.py          # SFT fine-tuning with rsLoRA
├── phase2_evaluation.py        # Comprehensive benchmark evaluation
├── phase3_dpo.py               # DPO alignment refinement + HF push
│
├── dataset_score80plus.jsonl   # Filtered training dataset (score ≥ 80)
├── eval_set.jsonl              # Held-out evaluation set
├── preference_pairs.jsonl      # DPO preference pairs (auto-generated)
│
└── logs/
    ├── phase1_training.log
    ├── phase2_evaluation.log
    └── phase3_dpo.log

Data Pipeline — Phase 0

Overview

Phase 0 executes an institutional-grade data ingestion, harmonization, and semantic synthesis pipeline. Raw regulatory disclosures and market pricing data are transformed into a high-density, multi-year causal dataset through five distinct engineering stages.


Stage 1 — Multi-Source Data Ingestion and Universe Alignment

The core database aggregates information from public corporate disclosures and external market data feeds over a strict trailing five-year temporal window.

Sources:

  • SEC EDGAR: Form 10-K (annual) and Form 10-Q (quarterly) filings
  • Yahoo Finance: Historical pricing data for abnormal return calculation

Universe:

  • Full S&P 500 index
  • 50+ additional companies selected for cross-industry variety and complex capital structures

Every record contains the complete accounting history of a firm alongside its associated real-world market valuation.


Stage 2 — Accounting Normalization and Mathematical Adjustments

Raw SEC filings present significant variations in reporting methodology. Three critical normalization operations are applied:

Scaling Normalization

scaling_unit key normalizes all numbers reported in thousands,
millions, or billions into uniform absolute floats.

Decumulation

decumulation_applied flag triggers discrete mathematical subtractions
to extract true standalone non-cumulative metrics for individual quarters.

Corporate issuers report quarterly numbers on a cumulative year-to-date
basis in later quarters. Without decumulation, Q3 figures would contain
Q1 + Q2 + Q3 data, not Q3 alone.

Year-over-Year Alignment

quarter_aligned_yoy validates that year-over-year quarterly comparisons
map precisely to equivalent operational windows of the prior fiscal period.

Stage 3 — Market Reaction and Alternative Data Splicing

For every 10-K and 10-Q filing parsed, the ingestion engine:

  1. Records the exact filing date
  2. Defines a precise post-filing trading window via window_start, window_end, window_days
  3. Calculates total asset performance via stock_return_pct
  4. Cross-references against benchmark via spy_return_pct
  5. Isolates true alpha via abnormal_return_pct
abnormal_return_pct = stock_return_pct - spy_return_pct

This quantitative alpha metric is mapped to a categorical market_reaction token, enabling the model to evaluate financial performance in the context of real-world investor expectations.


Stage 4 — Automated Sanity Filter and Structural Guardrails

An automated validation layer acts as a hard filter before any LLM processing occurs. Three primary control metrics govern this process:

Metric Function
sanity_flags_count Records exact number of logical or arithmetic warnings
sanity_confidence Assigns qualitative reliability ranking to parsed source
sanity_action Gateway control — only pass records enter training

Only records achieving sanity_action: pass proceed to the synthesis stage. This guarantees the learning system trains exclusively on mathematically sound corporate reporting.


Stage 5 — Multi-Agent Synthesis and Critique Loop

The final training set is produced through an expansive multi-agent generator-auditor pipeline executing exactly 40,000 total LLM calls.

40,000 LLM calls
        ↓
Generator produces structured causal analysis
        ↓
Auditor cross-references every claim against raw financials
        ↓
cross_ref_validated → True / False
causal_density_score calculated
        ↓
Rejection or regeneration if evidence linkage insufficient
        ↓
17,000 gold-standard records

Generator responsibilities:

  • Parse raw financial arrays and prose sections
  • Build multi-layered longitudinal analysis
  • Track changes over time with distinct financial signals
  • Synthesize narrative instruction pairs with historical trends, segment performance, and capital allocation strategies

Auditor responsibilities:

  • Cross-reference every text statement against raw financial numbers
  • Verify every observation is anchored to accounting realities
  • Set cross_ref_validated flag
  • Verify data blend via data_source_mix
  • Score multi-step reasoning via causal_density_score
  • Trigger rejection or forced regeneration on shallow synthesis

Result: 17,000 elite high-fidelity records for causal financial intelligence.


Dataset Quality Distribution (Post-Filter)

Quality Score Proportion
100 60.7%
90 28.7%
85 3.9%
80 6.1%
< 80 (excluded) 0.6%

All records scoring below 80 were excluded prior to training. Records below 80 exhibited LOW sanity_confidence, high causal_density on weak evidence, and minimal text grounding — a hallucination risk pattern.


Example Record Structure

{
  "id": "e5e12426a61a",
  "meta": {
    "ticker": "A",
    "company_name": "Agilent Technologies Inc",
    "form_type": "10-K",
    "filed_date": "2022-12-21",
    "period_of_report": "2022-10-31",
    "fiscal_year": 2022,
    "abnormal_return": {
      "abnormal_return_pct": 1.388,
      "market_reaction": "neutral"
    },
    "sanity_action": "pass",
    "sanity_confidence": "high",
    "sanity_flags_count": 0,
    "scaling_unit": "millions"
  },
  "metrics": { "..." },
  "changes": { "..." },
  "signals": { "..." },
  "instruction_pair": {
    "instruction": "...",
    "input": "...",
    "output": "...",
    "metadata": {
      "quality_score": 90,
      "hallucination_flag": false,
      "causal_density_score": 17,
      "cross_ref_validated": true
    }
  }
}

Training Pipeline

Phase 1 — Supervised Fine-Tuning (SFT)

Objective: Adapt the model to financial analytical reasoning without destroying its general reasoning priors.

Parameter Value
Method rsLoRA
Epochs 3
Batch size (effective) 16
Learning rate 1e-4
LR scheduler Cosine
Warmup ratio 5%
Gradient clipping 1.0
Sequence length 4096
Early stopping Yes — best checkpoint saved

What Phase 1 optimizes:

  • Formatting fidelity
  • Evidence grounding discipline
  • Uncertainty language consistency
  • Evidence-to-conclusion linkage
  • Disciplined refusal on insufficient evidence

Output: fcyber/FinReasoner-qwen2.5-14b-instruct-phase1


Phase 2 — Comprehensive Evaluation

Objective: Rigorously measure what the model actually learned.

Benchmark Target
Hallucination rate < 5%
Uncertainty language > 60%
Causal discipline > 80%
Valid JSON format > 95%
Evidence rich outputs > 80%
Calibrated confidence > 85%

Decision rule:

All targets met  → Deploy Phase 1 model. Skip Phase 3.
Any target fails → Fix dataset or proceed to Phase 3.

Phase 3 — DPO Alignment Refinement

Objective: Push model away from fabricated overconfident outputs toward grounded calibrated reasoning.

Only executed if Phase 2 reveals:

  • Hallucination > 5%
  • Calibration < 85%
  • Uncertainty language < 60%
Parameter Value
Method DPO
Epochs 1
Learning rate 5e-5
Beta 0.1
Preference pairs Score ≥ 90 examples only

Preference pair structure:

Chosen   → grounded, calibrated, evidence-constrained output
Rejected → fabricated, overconfident, unsupported output

Output: fcyber/FinReasoner-qwen2.5-14b-instruct (final production)


Evaluation Benchmarks

Core Benchmarks

Benchmark What It Measures
Hallucination rate Claims not supported by input evidence
Unsupported metric rate Metrics referenced but not in input
Uncertainty calibration Confidence proportional to evidence strength
Contradiction handling Correct behavior on conflicting evidence
Held-out company generalization Reasoning transfer to unseen companies
Causal attribution legitimacy Are causal claims licensed by evidence

Adversarial Stress Tests

Test Description
Missing denominators Ratio calculation with absent denominator
Swapped fiscal years Mislabeled period detection
Restated metrics Pre/post restatement handling
Misleading management language Spin vs reported numbers
Corrupted tables Missing cell detection
Incomplete quarterly disclosures Partial period boundary enforcement
Intentionally ambiguous evidence Forced refusal on unresolvable cases

Inference

Preferred (Production)

EXL2 — optimized for fast token generation on A100

Fallback

GGUF Q4_K_M — wide framework compatibility

Example Usage

from unsloth import FastLanguageModel
from huggingface_hub import login
import torch

# ── ONLY NEEDED IF REPO IS PRIVATE ──────────────────────
login()   # uses cached token from huggingface-cli login
# ────────────────────────────────────────────────────────

# ── LOAD MODEL (downloads automatically on first run) ────
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name     = "fcyber/FinReasoner-qwen2.5-14b-instruct",
    max_seq_length = 4096,
    dtype          = torch.bfloat16,
    load_in_4bit   = True,
)
FastLanguageModel.for_inference(model)

# ── INFERENCE ────────────────────────────────────────────
messages = [
    {"role": "system",  "content": "You are an Elite Senior Financial Analyst..."},
    {"role": "user",    "content": "COMPANY: Agilent Technologies..."},
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize              = False,
    add_generation_prompt = True,
)

inputs  = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(
    **inputs,
    max_new_tokens = 1024,
    temperature    = 0.1,
    do_sample      = True,
    pad_token_id   = tokenizer.eos_token_id,
)

response = tokenizer.decode(
    outputs[0][inputs["input_ids"].shape[1]:],
    skip_special_tokens=True
)
print(response)

Logging

All three phases produce structured logs via Loguru.

logs/phase1_training.log    ← training loss, warnings, push status
logs/phase2_evaluation.log  ← per-record hallucination, calibration flags
logs/phase3_dpo.log         ← DPO training, preference pair build, push status

Log format:

2026-05-15 14:32:01 | INFO    | Starting Phase 1 training
2026-05-15 14:32:45 | SUCCESS | rsLoRA applied successfully
2026-05-15 16:44:12 | WARNING | Hallucination detected | ticker=VTRS | invented_numbers=7
2026-05-15 18:01:33 | SUCCESS | Phase 1 model pushed to: https://huggingface.co/fcyber/...

Rollback Strategy

Phase 3 worse than Phase 1 in evaluation?
→ Roll back to Phase 1 model
→ Push Phase 1 directly to production repo
→ fcyber/FinReasoner-qwen2.5-14b-instruct

Never overwrite a working model without evaluation confirmation.


Key Design Decisions

- rsLoRA over standard LoRA

Standard LoRA has unstable gradient scaling at higher ranks. rsLoRA applies mathematically correct scaling that produces stable gradients regardless of rank — critical for analytical reasoning tasks sensitive to gradient noise.

- 3 epochs not 5+

rsLoRA adapters overfit faster than full fine-tuning due to limited parameter space. At 20k examples with 4096 sequence length, 3 epochs with early stopping provides sufficient signal without memorization. Early stopping via load_best_model_at_end ensures the best checkpoint is always used.

- lora_dropout = 0.05

The model already has five layers of implicit regularization: 4-bit NF4 quantization, gradient clipping, cosine scheduler, early stopping, and limited adapter rank. High dropout would kill the learning signal in a small adapter parameter space.

- DPO beta = 0.1

Conservative beta prevents over-refusal. Aggressive beta values push the model so hard away from rejected outputs that it begins refusing when it should answer. Start at 0.1, increase only if hallucination persists after Phase 3.

- single A100 80GB not dual

Dual GPU adds synchronization complexity, distributed instability, and debugging overhead without benefit at this scale. The rsLoRA + 4-bit configuration peaks at ~40–50GB VRAM, well within single A100 capacity.


Dependencies

pip install unsloth
pip install -U xformers --index-url https://download.pytorch.org/whl/cu121
pip install trl datasets transformers accelerate bitsandbytes loguru huggingface_hub

HuggingFace Login

huggingface-cli login
# Token: https://huggingface.co/settings/tokens

Citation

@misc{finreasoner2024,
  author    = {fcyber},
  title     = {FinReasoner: Evidence-Grounded Financial Reasoning Engine},
  year      = {2024},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/fcyber/FinReasoner-qwen2.5-14b-instruct}
}

License

This model is intended for research and institutional use. Base model license: Qwen 2.5 — see Qwen License


Author

fcyber

About

A fine-tuned LLM for causal financial reasoning on SEC filings (10-K, 10-Q), integrating structured metrics, textual disclosures, and market signals to replicate analyst-level insights at scale.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors