Updated Model #18

MeridianAlgo-Developer · 2026-01-27T21:12:23Z

MeridianAlgo-Developer
Jan 27, 2026
Maintainer

We are updating our model since the preivous one had multiple issues

MeridianAlgo-Developer · 2026-01-27T21:20:14Z

MeridianAlgo-Developer
Jan 27, 2026
Maintainer Author

FinAI-Core v2.2 Architecture Overview

This is the complete, final architecture for the ultra-efficient, unlocked, continual-learning financial language model we've designed. It's optimized for training from scratch on free GitHub Actions CPU runners in 2–4 weeks, with ~700M total parameters (~350M active per token via MoE sparsity).

Model Type: Decoder-only causal language model (Hugging Face PreTrainedModel compatible)
Total Parameters: ≈700M
Active Parameters per Token: ≈350M
Context Length: Default 8k tokens (extendable to 16k at inference via MLA)
Philosophy: Fully unlocked (open weights, no censorship/RLHF restrictions, maximal truth-seeking), finance-specialized, continual learning built-in.

1. Tokenizer

Base: Byte-level BPE or Unigram (similar to Llama 3 / Gemma 2)
Vocabulary enhancements: Add 400–600 special tokens for finance-specific entities
- Examples: Stock tickers (<AAPL>, <TSLA>), currencies (<USD>, <EUR>), dates/formats (<YYYY-MM-DD>), accounting terms (<REVENUE>, <EBITDA>), math symbols, regulatory phrases
Purpose: Reduces sequence length on financial documents by ~30–40%

2. Embedding Layer

Embedding dimension: 1280
Input embeddings tied to LM head (modern dict format: {"lm_head.weight": "model.embed_tokens.weight"})
No separate token type embeddings (causal LM standard)

3. Positional Encoding

Primary: Rotary Embeddings (RoPE)
Custom upgrade: Delta-RoPE
- Small gated MLP per layer that learns delta updates to rotary frequencies
- Improves extrapolation for long financial timelines (e.g., multi-year forecasts)

4. Layer Stack (20 Layers Total)

Hybrid alternating design for CPU efficiency:
- 60% of layers (≈12 layers): Mamba-2 State Space Model (SSM) blocks
  - Recurrent states for linear time/space scaling (ideal for CPU and long sequences)
  - Sparse Recurrent Skipping: During training/inference, skip state updates on 40–50% of low-information tokens (detected via cheap heuristic like low router activation or token frequency)
  - Saves 20–40% compute on boilerplate financial text
- 40% of layers (≈8 layers): Transformer blocks
  - Grouped-Query Attention (GQA): 6 query heads, 2 KV heads
  - DeepSeek Sparse Attention (DSA): Fine-grained structured sparsity (>60% sparse connections, e.g., top-k per row + block sparsity)
  - Multi-head Latent Attention (MLA): KV cache compressed into latent vectors (chunk-wise, rank 48–64) for cheap long context
  - Sliding window local attention fallback for additional speed
Normalization: Pre-norm RMSNorm before every sub-layer (attention/SSM + FFN/MoE)

5. Feed-Forward / MoE Component (Per Layer)

Replaces dense MLP with Sparse MoE (DeepSeekMoE style)
- 4–6 small experts per layer
- Each expert: SwiGLU MLP with hidden dimension 2560
- Routing: Top-2 gating (no auxiliary load-balancing loss required)
- Finance specialization bias: Optional lightweight auxiliary loss to route quant/math-heavy sequences to dedicated "quant experts" and regulatory text to "compliance experts" (via regex heuristics)

6. Multi-Token Prediction (MTP) Head

DeepSeek-inspired auxiliary prediction
- Main LM head: Predicts next 1 token (standard causal loss)
- Auxiliary heads: Predict next 3 additional tokens in parallel (shared trunk, separate small projectors)
- Loss weighting: 0.5 on main + 0.5 on auxiliary
- Benefits: ~1.5–2x faster convergence, stronger reasoning/math performance

7. Output Layer

Standard LM head (tied to input embeddings)
Vocabulary size: Base (~50k–65k) + 400–600 special finance tokens

8. Continual Learning Mechanisms (Built-In)

Replay Buffer: On-disk buffer storing 10–20M high-quality recent tokens; sample 10–20% during any training to prevent forgetting
Online LoRA Adapter System: Pre-defined LoRA targets (q_proj, v_proj, gate_proj, expert gates); utilities for merging community-submitted adapters via GitHub PRs
Auto-Data Ingestion Hook:
- Separate script (continual_ingest.py)
- Weekly scrapes fresh public data (SEC EDGAR filings via RSS, arXiv finance papers, public news APIs, Yahoo Finance summaries)
- Filters/cleans data, triggers short continuation runs on GitHub Actions

9. Training / Inference Optimizations

From scratch (random initialization)
CPU-focused: torch.compile(), 8-bit AdamW (bitsandbytes), micro-batch=1, high gradient accumulation (32–64)
Checkpointing: Every 300–400 steps or ~30–45 minutes
Dataset focus: 100–200B high-quality finance tokens (EDGAR filings, arXiv finance/econ, Pile finance subset, synthetic Q&A/math via open models/APIs)

This architecture balances cutting-edge 2026 innovations (DeepSeek DSA/MLA/MoE/MTP + Mamba-2 hybrid) with extreme efficiency for your constraints. It will deliver strong finance-specific performance (report analysis, forecasting, compliance, quant reasoning) while remaining lightweight and continually evolving. With better error handling!!!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated Model #18

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Updated Model #18

Uh oh!

MeridianAlgo-Developer Jan 27, 2026 Maintainer

Replies: 1 comment

Uh oh!

MeridianAlgo-Developer Jan 27, 2026 Maintainer Author

FinAI-Core v2.2 Architecture Overview

1. Tokenizer

2. Embedding Layer

3. Positional Encoding

4. Layer Stack (20 Layers Total)

5. Feed-Forward / MoE Component (Per Layer)

6. Multi-Token Prediction (MTP) Head

7. Output Layer

8. Continual Learning Mechanisms (Built-In)

9. Training / Inference Optimizations

MeridianAlgo-Developer
Jan 27, 2026
Maintainer

MeridianAlgo-Developer
Jan 27, 2026
Maintainer Author