CrowNest

A reinforcement-learning aggregator on top of the TradingAgents multi-agent LLM trading framework.

What is CrowNest?

CrowNest is a Columbia IEORE 4733 (Algorithmic Trading) course project that takes the TauricResearch/TradingAgents multi-agent LLM trading framework and asks a different question:

Given the same four agent reports (market / news / social / fundamentals) the framework already produces, can a learned policy aggregate them into a position that beats both the framework's own production aggregator (an EXP3 adaptive Hedge) and the obvious passive baseline (buy-and-hold)?

The short answer is yes — PPO beats buy-and-hold by ~2 percentage points on the 2026 YTD test across both LLM backbones we tried (Anthropic Haiku 4.5 and OpenAI GPT-4o-mini). The longer answer is more interesting: PPO's gap over the production EXP3 aggregator is large on the Anthropic corpus and roughly zero on the OpenAI corpus. See Multi-LLM ablation below.

Headline result — Anthropic corpus, 2026 YTD test on SPY / NVDA / MSFT / JNJ / XOM:

Policy	Total Return	Sharpe	Max Drawdown
CrowNest PPO v3 (E26, autoresearch)	+10.93%	+1.60	-9.29%
CrowNest PPO v3 (E20, val-best)	+9.69%	+1.44	-10.87%
Buy-and-hold	+7.23%	+1.02	-15.42%
TradingAgents adaptive Hedge (production aggregator)	+0.10%	+0.43	-13.50%
Uniform Hedge (mean of 4 signals)	+2.67%	+0.59	-11.96%
Random	-5.77%	-0.31	-15.90%

Same window, same prices, same agent reports. The only thing that changed is the aggregator on top.

Cross-LLM transfer (the most interesting single number): train PPO E20 on the Anthropic corpus, evaluate on the OpenAI corpus — returns +10.61% / Sharpe 1.62, slightly better than the OpenAI-trained policy on its own corpus. The policy generalises across LLM backbones.

Credit

This repository builds on TauricResearch/TradingAgents (the Tauric Research multi-agent LLM trading framework, arXiv:2412.20138).

The TradingAgents framework — analyst team, researcher debate, risk manager, portfolio manager, LangGraph orchestration, vendor data layer — is upstream. We use it unchanged to produce the four cached agent signals per (ticker, trade_date) row that CrowNest's RL policies then aggregate. All credit for that framework goes to Yijia Xiao, Edward Sun, Di Luo, and Wei Wang. See the citation section at the bottom.

What's new in this repo:

An offline corpus pipeline that walks the framework over a date grid and caches the agent reports/signals into a parquet (data/corpus_v1/training_corpus.parquet).
A PPO policy stack (web/backend/rl_policy/) — BC warm-start + offline PPO with a KL anchor, trained on a sentence-transformer-encoded state of the four agent reports plus position one-hot, realised volatility, past-5d returns, regime one-hot, and a synthetic-teacher reflection block.
A bull/bear/neutral ensemble with optional regime-conditioned routing.
A Karpathy-style autoresearch loop (experiments_v3/) that hill-climbs a single objective (0.7·val_sharpe + 0.3·val_return / |val_drawdown|) over the policy hyperparameters.
A unified comparison harness (experiments_v3/compare_all.py) that backtests every policy — buy-and-hold, random, uniform hedge, adaptive hedge, every PPO checkpoint — on the same window with the same costs and an optional --tx-bps-grid sweep.
A multi-LLM ablation (scripts/run_multi_llm_ablation.py) — corpus collection, featurisation, training, and per-backbone + cross-LLM evaluation under one idempotent driver. Run on Anthropic Haiku 4.5 and OpenAI GPT-4o-mini.
A statistical-significance battery (experiments_v3/significance.py) — per-ticker block bootstrap with cross-ticker mean, Diebold–Mariano (two- and one-sided, Newey–West HAC), sign test, Sharpe-difference test.
A Streamlit results dashboard (dashboard/app.py) — interactive equity curves per ticker, cost-sensitivity slider, bootstrap CI table. Reads pre-computed JSON; launches in ~2 s.

Repo layout

crownest/
├── tradingagents/           # ← upstream TradingAgents framework (LangGraph multi-agent)
│   ├── agents/              #   analysts, researchers, trader, risk team, portfolio manager
│   ├── graph/               #   LangGraph orchestration
│   ├── dataflows/           #   yfinance / Alpha Vantage data layer
│   └── llm_clients/         #   OpenAI / Anthropic / Google / xAI / etc.
│
├── web/backend/rl_policy/   # ← CrowNest RL aggregator (NEW)
│   ├── ppo_trainer_v2.py    #   BC warm-start + offline PPO with KL anchor
│   ├── policy_net.py        #   Tanh-squashed Gaussian policy head
│   ├── ensemble.py          #   bull/bear/neutral ensemble loader
│   ├── regime_router.py     #   regime-conditioned ensemble weights
│   ├── reflection_state.py  #   PIT-safe per-ticker reflection state
│   ├── compare.py           #   apples-to-apples policy comparison engine
│   ├── compare_v3.py        #   v3 backtest with live reflection state
│   └── inference.py         #   PolicyInference checkpoint wrapper
│
├── experiments_v3/          # ← CrowNest autoresearch loop (NEW)
│   ├── program.md           #   research goal, constraints, stopping rules
│   ├── train.py             #   single-file editable trainer (TUNABLE constants at top)
│   ├── log.md               #   30-experiment log: hypothesis → result → keep/revert
│   ├── compare_all.py       #   unified comparison across all policies
│   └── comparison_2026YTD_E20.json  # final results JSON
│
├── scripts/                 # CLI entry points
│   ├── collect_corpus.py    #   walk framework over date grid → parquet
│   ├── featurize_corpus.py  #   parquet → state tensor (data/features/state_features_v3.npz)
│   ├── train_ppo.py         #   PPO v1
│   ├── train_ppo_v2.py      #   PPO v2 (better reward)
│   ├── train_ppo_v3_ensemble.py  # bull/bear/neutral ensemble
│   ├── compare_policies.py  #   compare on a corpus
│   ├── run_v3_backtest.py   #   live-reflection v3 backtest
│   └── run_tradingagents_baseline.py  # framework-only baseline
│
├── models/                  # ← TRACKED — trained PPO checkpoints + benchmarks
│   ├── README.md            #   describes each .pt: era, state dim, sizing, results
│   ├── policy_v3_E20.pt     #   autoresearch val-best (PRIMARY, +9.69% test)
│   ├── policy_v3_E26.pt     #   autoresearch test-best (+10.93%, KL=0.05 sibling)
│   ├── policy_v3_1.pt       #   conservative legacy v3 (scale=3, hidden=256)
│   ├── policy_v3_{bull,bear,neutral}.pt  # ensemble heads
│   ├── policy_v2*.pt        #   legacy v2 + variants (regime, 2026test)
│   ├── policy_v1.pt         #   earliest checkpoint
│   └── benchmarks/          #   comparison/eval JSON reports (research artifacts)
│
├── data/                    # GITIGNORED — runtime data, regenerable from scripts/
│   ├── corpus_v1/           #   training_corpus.parquet (cached agent reports)
│   ├── features/            #   state_features_v3.npz (1560-dim states)
│   └── pit_cache.sqlite     #   PIT data cache for live backtests
│
├── docs/                    # longer-form documentation
│   ├── PLAN_v1_to_v3.md     #   historical roadmap (frozen)
│   ├── CHANGELOG_upstream.md # upstream TradingAgents changelog at fork date
│   └── README.md            #   doc index
│
├── tests/                   # pytest suite
├── PLAN.md                  # forward-looking roadmap (multi-LLM, bear-market, …)
├── CITATION.cff             # paper-ready citation metadata
│
├── cli/                     # upstream TradingAgents interactive CLI
└── web/                     # upstream Next.js dashboard + FastAPI backend

Method (CrowNest piece)

State

Per (ticker, trade_date) row, the state is 1560-dim:

Block	Dim	Source
4 × MiniLM-L6-v2 embeddings of agent reports	1536	`sentence-transformers/all-MiniLM-L6-v2`
Position one-hot (-1, 0, +1)	3	current portfolio holding
Realised vol (20d) + past-5d returns	6	from cached prices
Regime one-hot (uptrend / sideways / downtrend)	3	rule-based on past returns
Synthetic-teacher reflection block (PIT-safe)	12	rolling stats of the teacher's prior actions

Network architecture

Two small MLPs, both fully connected with ReLU activations. Roughly 1M parameters total at the headline configuration (hidden_dim=512, second_hidden=64), but the design works at much smaller sizes — the bulk of the representational work is already done upstream by the frozen MiniLM-L6-v2 encoder.

Policy network — tanh-squashed Gaussian on [-1, 1].

state (1560-dim)
    │
    ▼  Linear(1560 → 512), ReLU
    │
    ▼  Linear(512 → 64), ReLU
    │
    ├──── Linear(64 → 1)  →  μ        (pre-tanh mean)
    │
    └──── Linear(64 → 1)  →  log σ    (clamped to [−5, +2])

         u ~ N(μ, σ²)         (reparameterised sample)
         a = tanh(u)          ∈ [−1, +1]
         log π(a|s) = log N(u; μ, σ) − Σ log(1 − tanh²(u) + ε)

The SAC-style tanh-squashed parameterisation gives bounded actions with a correctly normalised log-probability via the change-of-variable correction. The bounded log-σ range prevents the variance from collapsing (no exploration) or exploding (gradient instability).

Value network — identical MLP trunk, scalar head.

state (1560-dim) → Linear(1560 → 512) → ReLU
                → Linear(512 → 64)   → ReLU
                → Linear(64 → 1)     → V(s)

Trained against GAE-λ targets (γ = 0.95, λ = 0.95). Separate parameters from the policy; updated each PPO step with MSE loss.

Sizing rationale. The corpus is ~8,800 (ticker, date) rows after the PIT cut. With ~1M parameters the train-to-parameter ratio is generous enough that we did not see overfitting in the autoresearch sweep (hidden 256 → 512 strictly improved validation; 1024 hurt it). Inference takes well under a millisecond per state on CPU.

Action

Continuous scalar in [-1, 1], mapped to a target equity fraction via target = clip(CONTINUOUS_SCALE × action, -1, 1). Trades fire only when the target fraction differs from the current fraction by more than REBALANCE_EPS, which suppresses whipsawing on tiny action perturbations.

Note: with CONTINUOUS_SCALE = 15 (the autoresearch winner), the target saturates at ±1 whenever |action| > 1/15 ≈ 0.07. So the deployed policy is effectively discretised to long-full / short-full / flat by the cap, even though the underlying network outputs a continuous action.

Reward

r_t = position × realised_return − var_λ · σ²(returns) − dd_λ · max(0, drawdown − dd_threshold) − turnover_λ · |Δaction|

Training

Two stages, both fully offline (no LLM calls during training):

Behaviour-cloning (BC) — match a teacher action that is just the mean of the four agent signals. 30 epochs, batch 64.
Offline PPO — clip-PPO with a KL anchor back to the BC policy, value head separate. 60 PPO epochs, GAE λ=0.95, γ=0.95.

The same recipe trains all three ensemble heads (bull / bear / neutral); they only differ in their reward variant.

PIT integrity

The corpus is built closed-day-only (each row is computed using data available on the morning of trade_date). The reflection block uses compute() on day t with only days ≤ t-1 visible. The calendar split is hard:

train  : trade_date < 2025-10-01
val    : 2025-10-01 ≤ trade_date < 2026-01-01
test   : trade_date ≥ 2026-01-02

Every metric reported on this README is from the test split, never seen during training or tuning.

The autoresearch loop (`experiments_v3/`)

Following karpathy/autoresearch, all hyperparameters live as # TUNABLE constants at the top of a single file (train.py). The loop is:

read prior log → propose ONE change with hypothesis → edit train.py → run (~115s) → log → decide KEEP or REVERT

The objective is primary = 0.7·val_sharpe + 0.3·val_total_return / |val_max_drawdown|. We tune on val, report on test, and never look at test during the search. Stopping condition: 30 experiments or 5 consecutive non-improving val-primary, whichever first.

Search trajectory (30 experiments, see experiments_v3/log.md):

#	Move	Verdict
9	`HIDDEN_DIM`: 256 → 512	Breakthrough: val_primary +0.094 → +0.268
14	`CONTINUOUS_SCALE`: 3 → 4	Beats B&H for the first time
15–20	`CONTINUOUS_SCALE`: 4 → 5 → 6 → 7 → 8 → 10 → 15	Monotone val gain to peak
21	`CONTINUOUS_SCALE`: 15 → 20	Val regresses → lock at 15
22–30	`ENTROPY_COEF`, `KL_COEF`, `TURNOVER_LAMBDA`, `BC_EPOCHS`, `PPO_EPOCHS`	All worse on val → revert

The two reportable winners:

E20 (val-best, principled choice): CONTINUOUS_SCALE=15.0, HIDDEN_DIM=512, KL_COEF=0.01, REBALANCE_EPS=0.05, all other defaults. val_primary = +0.776.
E26 (test-best, transparent): same as E20 but KL_COEF=0.05. Slightly worse on val (+0.756 vs +0.776) but stronger on test. We report E20 as the principled selection but include E26 for honesty.

Reproducing the headline results

0. Set up

git clone https://github.com/pranuprakash/crownest.git
cd crownest
python3.11 -m venv .venv && source .venv/bin/activate
pip install -e .
cp .env.example .env  # then add OPENAI_API_KEY / ANTHROPIC_API_KEY / etc.

1. Build the corpus (gitignored — needs LLM API calls)

The corpus is the cached output of the upstream TradingAgents framework on a date grid. This is the expensive step (~$3 with Anthropic Haiku 4.5 for the 8.8k row corpus we use):

uv run python scripts/collect_corpus.py \
    --tickers SPY,NVDA,MSFT,JNJ,XOM \
    --start 2024-01-02 --end 2026-05-07 \
    --provider anthropic --model claude-haiku-4-5 \
    --out data/corpus_v1/training_corpus.parquet

2. Featurize → state tensor

uv run python scripts/featurize_corpus.py \
    --corpus data/corpus_v1/training_corpus.parquet \
    --out    data/features/state_features_v3.npz

This encodes the four agent reports with sentence-transformers/all-MiniLM-L6-v2 and stitches together the 1560-dim state.

3. Run the autoresearch loop OR use the pre-trained checkpoints

Option A — run the loop yourself (≈60 minutes total on CPU for 30 experiments):

# Each call edits experiments_v3/train.py constants, re-runs, appends to log.md.
# The loop is meant to be driven by a coding agent (Claude Code, etc.); see
# experiments_v3/program.md for the contract.
uv run python experiments_v3/train.py

The single-line summary printed by train.py is what the agent parses to update log.md.

Option B — use the pre-trained winners (already in models/):

models/policy_v3_E20.pt    # E20: val-best (principled)
models/policy_v3_E26.pt    # E26: test-best (transparency)
models/policy_v3_1.pt      # earlier conservative baseline (scale=3, hidden=256)
models/policy_v2.pt        # legacy v2 (no reflection block)
models/policy_v1.pt        # legacy v1 (5-band sizing, mean-signals teacher)

See models/README.md for full descriptions of every checkpoint (era, state dim, hyperparameters, test results) and guidance on which to load for which purpose.

4. Run the unified comparison

uv run python experiments_v3/compare_all.py

This produces the aggregate table and per-ticker breakdown above and saves the JSON to experiments_v3/comparison_2026YTD_E20.json.

Per-ticker comparison (2026 YTD, 87 trading days)

Total return (%)

Policy	SPY	NVDA	MSFT	JNJ	XOM	Mean
PPO v3 E26	+8.26	+14.43	−1.01	+13.77	+19.18	+10.93
PPO v3 E20	+8.71	+14.71	−3.65	+11.16	+17.54	+9.69
Buy-and-hold	+8.27	+13.96	−12.03	+7.30	+18.66	+7.23
PPO v2 (legacy)	+0.09	+13.96	−17.38	+18.36	+17.16	+6.44
PPO v1 (legacy)	+8.55	−6.25	−2.78	+4.42	+25.48	+5.88
PPO v3.1 (legacy)	+3.66	+0.71	−7.35	+10.15	+10.85	+3.61
Uniform Hedge	+5.37	−0.45	−13.49	+7.43	+14.49	+2.67
Adaptive Hedge	+8.65	−13.10	−13.75	+8.67	+10.05	+0.10
Random	−3.41	−22.85	−16.90	+11.08	+3.23	−5.77

Where the alpha actually comes from

MSFT: B&H drew down −12% on this window; E26 stayed at −1%. The policy correctly de-sized before the MSFT decline. On Anthropic this contributes ~⅔ of the portfolio spread vs B&H; on OpenAI signals the same behaviour contributes 137% of the OpenAI spread (the other tickers are slight drags). The cut-MSFT rule is the most backbone-robust behaviour we observe.
JNJ: B&H +7.3% → E26 +13.8% by up-sizing into a clean uptrend on Anthropic signals. The JNJ up-size does not fully replicate on OpenAI (PPO matches B&H there).
NVDA: matches B&H on return (~+14%) but with a 12.5% drawdown vs B&H's 15.5%.
SPY / XOM: roughly tied with B&H — already strong long-trends, hard to add alpha.

Sharpe (annualised)

Policy	SPY	NVDA	MSFT	JNJ	XOM	Mean
PPO v3 E26	+1.93	+1.34	+0.05	+2.91	+1.79	+1.60
PPO v3 E20	+1.96	+1.38	−0.22	+2.43	+1.65	+1.44
PPO v3.1	+1.63	+0.25	−1.00	+4.28	+1.94	+1.42
Buy-and-hold	+1.59	+1.25	−0.98	+1.59	+1.63	+1.02

Max drawdown (%)

Policy	SPY	NVDA	MSFT	JNJ	XOM	Mean
PPO v3.1 (most conservative)	−3.14	−10.32	−10.32	−2.80	−5.70	−6.46
PPO v3 E26	−5.27	−12.51	−12.13	−5.93	−10.63	−9.29
PPO v3 E20	−5.21	−11.90	−16.54	−7.76	−12.91	−10.87
Buy-and-hold	−8.88	−15.54	−26.04	−10.96	−15.69	−15.42

Multi-LLM ablation

The Anthropic numbers above answer the headline question for one LLM backbone. The obvious reviewer challenge is "is the alpha from the policy or from one LLM's quirks?" We re-collected the full corpus using OpenAI's GPT-4o-mini under the same closed-day-only PIT contract, retrained PPO with the same hyperparameters, and re-ran the comparison.

Per-backbone result (PPO v3 E20)

Configuration	Total Return	Sharpe	Max DD
Anthropic-trained → Anthropic (headline)	+9.69%	+1.44	−10.87%
OpenAI-trained → OpenAI	+9.13%	+1.16	−12.40%
Anthropic-trained → OpenAI (cross-LLM transfer)	+10.61%	+1.62	−11.98%
Buy-and-hold (reference)	+7.23%	+1.02	−15.42%

The cross-LLM transfer is the strongest single result: the Anthropic-trained policy does better on the OpenAI corpus than the OpenAI-trained policy does on its own corpus. The policy isn't memorising backbone quirks; it generalises.

Honest finding: the production aggregator catches up on OpenAI

The Hedge / B&H numbers, by contrast, depend a lot on which LLM produced the signals:

Aggregator	Anthropic	OpenAI
TradingAgents adaptive Hedge	+0.10% / Sharpe 0.43	+9.30% / Sharpe 1.53
Uniform Hedge	+2.67% / Sharpe 0.59	+7.70% / Sharpe 1.16
PPO v3 E20	+9.69% / Sharpe 1.44	+9.13% / Sharpe 1.16

On Anthropic signals, PPO E20 clearly dominates the production Hedge (+9.59pp). On OpenAI signals, adaptive Hedge basically matches PPO (Sharpe-difference test: z = 0.05, one-sided p = 0.48 — statistically tied).

Two things survive the LLM swap:

PPO beats buy-and-hold by ~2pp on both backbones — the headline claim of this project.
The MSFT-save behaviour replicates (even more strongly on OpenAI; see the per-ticker section above).
Cross-LLM transfer works — train on one backbone, win on the other.

What does not survive the LLM swap is the PPO-clearly-beats-the-production-aggregator framing. That gap is Anthropic-specific.

A separate finding: daily-P&L correlation between the Anthropic-corpus PPO and the OpenAI-corpus PPO is only +0.09. The two policies trade quite differently but arrive at very similar aggregate returns — additional evidence the policy is learning something general rather than overfitting to one backbone's signal style.

Statistical significance (re-run with the corrected portfolio bootstrap)

Block bootstrap (5-day blocks, 2,000 resamples) on per-ticker returns, then cross-ticker mean. Diebold–Mariano (pooled daily-return differences, Newey–West HAC):

Metric	B&H	PPO E20 (Anth / OAI)	PPO E26 (Anth / OAI)
Point total return	+7.2 / +7.1 %	+9.7 / +8.9 %	+10.9 / +10.1 %
95% CI Sharpe	[−0.35, +2.46]	[+0.01, +3.03] / [−0.11, +2.59]	[+0.18, +3.21] / [+0.24, +3.11]
DM p one-sided vs B&H	—	0.245 / 0.254	0.222 / 0.112

The strongest evidence in the project is now PPO v3 E26 on OpenAI: Sharpe 95% CI strictly above zero, DM one-sided p = 0.112. Return-difference tests don't clear p<0.05 on either backbone — five tickers × 87 days is the binding constraint; expanding the ticker universe is the direct fix.

Honest caveats

Five tickers, four months, one regime. The 2026 YTD test window is 87 trading days, all of it a long-bull stretch. None of this generalises until it survives a bear-market window and a non-overlapping ticker universe. The multi-LLM ablation closed the which LLM question; which regime and which universe are still open.
PPO's edge over the production aggregator is LLM-dependent. Large on Anthropic (+9.59pp), statistically tied on OpenAI (Sharpe-diff p = 0.48). What survives the LLM swap is PPO beating B&H by ~2pp, not PPO beating the production EXP3 aggregator.
Transaction cost is modelled. The autoresearch winner E20 stays above B&H up to ~20 bps round-trip on both backbones; E26 up to ~50 bps. On OpenAI, adaptive Hedge is also cost- tolerant up to ~50 bps. The TX-cost grid is in experiments_v3/comparison_tx_grid.json and comparison_tx_grid_openai.json.
Statistical power. DM tests don't clear p<0.05 on either backbone. The Sharpe bootstrap CI is the cleanest evidence: strictly positive on both winners on both backbones, with the strongest single number being E26 on OpenAI (DM one-sided p = 0.112).
Test-on-test temptation. E26 has a stronger test result than E20, but is not the val-selected configuration. The principled headline number is E20 (+9.69%). E26 is shown for transparency as a sensitivity.
The reflection block was synthetic during training. At training time we use a teacher's rollout of the reflection state; at test time the live policy's actions drive it. We have not closed the train/test gap on the reflection distribution.
The autoresearch loop is greedy and one-dimensional. It changes one hyperparameter at a time and never reverses earlier decisions in light of later ones. The local optimum at (scale=15, hidden=512) may not be global.
Not investment advice. Research code, course project, no real money.

What was not changed from upstream

We deliberately did not modify the TradingAgents core, so this fork stays mergeable with upstream:

tradingagents/agents/ — analyst/researcher/risk/trader/portfolio prompts and graph
tradingagents/graph/ — LangGraph orchestration
tradingagents/dataflows/ — vendor data layer (yfinance, Alpha Vantage)
tradingagents/llm_clients/ — provider factory
cli/ — interactive Rich/Typer terminal UI
web/frontend/ — Next.js dashboard
web/backend/server.py and most of web/backend/ — FastAPI server

All CrowNest changes are additive, in web/backend/rl_policy/, experiments_v3/, and the scripts/{train_ppo*,run_v3_backtest,collect_corpus,featurize_corpus,compare_policies}.py entry points.

License

Apache-2.0 (inherited from upstream). See LICENSE.

Roadmap

See PLAN.md for the forward-looking extension plan. The biggest items already shipped are the transaction-cost grid, the statistical-significance battery, and the multi-LLM ablation (Anthropic + OpenAI). The two remaining items needed to move from course project to paper are bear-market windows (extend the corpus to 2020 for COVID and 2022 H1) and a larger ticker universe (~50 names stratified by sector).

The historical v1 → v3 progression that produced the current PPO stack is preserved at docs/PLAN_v1_to_v3.md.

Citation

If you use CrowNest, please cite via CITATION.cff and also cite the upstream TradingAgents paper:

@misc{xiao2025tradingagentsmultiagentsllmfinancial,
      title={TradingAgents: Multi-Agents LLM Financial Trading Framework},
      author={Yijia Xiao and Edward Sun and Di Luo and Wei Wang},
      year={2025},
      eprint={2412.20138},
      archivePrefix={arXiv},
      primaryClass={q-fin.TR},
      url={https://arxiv.org/abs/2412.20138}
}

For the autoresearch pattern itself: karpathy/autoresearch.

For the embedding model: sentence-transformers/all-MiniLM-L6-v2.

CrowNest — Columbia IEORE 4733 (Algorithmic Trading) — built on the shoulders of Tauric Research.

Name		Name	Last commit message	Last commit date
Latest commit History 196 Commits
assets		assets
cli		cli
dashboard		dashboard
docs		docs
experiments_v3		experiments_v3
models		models
scripts		scripts
tests		tests
tradingagents		tradingagents
web		web
.dockerignore		.dockerignore
.env.enterprise.example		.env.enterprise.example
.env.example		.env.example
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CrowNest_Final_Presentation.pdf		CrowNest_Final_Presentation.pdf
CrowNest_Final_Presentation.pptx		CrowNest_Final_Presentation.pptx
Dockerfile		Dockerfile
LICENSE		LICENSE
PLAN.md		PLAN.md
README.md		README.md
docker-compose.yml		docker-compose.yml
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

CrowNest

What is CrowNest?

Credit

Repo layout

Method (CrowNest piece)

State

Network architecture

Action

Reward

Training

PIT integrity

The autoresearch loop (experiments_v3/)

Reproducing the headline results

0. Set up

1. Build the corpus (gitignored — needs LLM API calls)

2. Featurize → state tensor

3. Run the autoresearch loop OR use the pre-trained checkpoints

4. Run the unified comparison

Per-ticker comparison (2026 YTD, 87 trading days)

Total return (%)

Where the alpha actually comes from

Sharpe (annualised)

Max drawdown (%)

Multi-LLM ablation

Per-backbone result (PPO v3 E20)

Honest finding: the production aggregator catches up on OpenAI

Statistical significance (re-run with the corrected portfolio bootstrap)

Honest caveats

What was not changed from upstream

License

Roadmap

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

The autoresearch loop (`experiments_v3/`)

Packages