Predict gene expression changes from CRISPRi perturbations in mouse bone marrow-derived macrophages (BMDMs).
Please checkout the website for full details!
This repository is my competition entry, built on top of the organizers' starter kit (upstream:
genentech/bioreasoningchallenge). The sections below the divider are the original starter-kit docs; everything inexamples/track_b_adversarial.py,examples/tools/, the benchmark harness, anddocs/track_b_architecture.mdis my own work.
Task. Given a (perturbation, gene) pair, predict the ternary effect of CRISPRi
knockdown on the target gene in mouse BMDMs (up / down / none). Scored as the mean
of two micro-AUROCs: DE (any effect vs none) and DIR (up vs down among DE-positive
rows). The train/test split is disjoint on both the perturbation and gene axes, so no
target gene's behavior can be memorized from training — the model must reason about an
unseen (pert, gene) interaction.
Because the metric is two independent AUROCs, I decomposed the prediction into two matched debates instead of one classifier. Per row, a moderator deterministically gathers a shared evidence dossier via tools, then runs:
- Debate 1 → DE: an EFFECT advocate vs a NULL advocate, scored by a judge →
P(DE) - Debate 2 → DIR: an UP advocate vs a DOWN advocate →
P(up | DE) prediction_up = P(DE)·P(up|DE),prediction_down = P(DE)·(1−P(up|DE))
Judges emit continuous calibrated probabilities (AUROC rewards ranking, not hard
labels). The key tool is pathway_neighbors — since exact lookups return nothing on the
disjoint split, it finds a perturbation's STRING network partners that do appear in
train and pools their knockdown label distribution (analogy, not memorization). See
docs/track_b_architecture.md for the full design.
Result: public leaderboard ≈ 0.569 on Track B.
- Track A (prompt-only, GPT-OSS-120B): derive continuous probabilities from the
softmax over the answer-token logprobs (
track_a_logprobs.py) rather than parsing a hard label, then average over 3 seeds — turning AUROC's appetite for ranking into a scoring lever. - Track C (fine-tuning): LoRA SFT of a <10B open model, served with vLLM at inference.
Rigor here is mostly about not chasing noise:
- DE detection is information-limited (~0.55), not prompt/effort/retrieval-limited. Sharpening the judge to suppress "famous-regulator" false positives was a real, isolated win; literature RAG, a learned feature combiner, and higher judge reasoning-effort all came back negative/noise in paired offline benchmarks.
- A gene regulatory network doesn't recover the indirect effects either. The natural
next idea — propagate a knockdown through a directed GRN instead of needing a documented
(pert, gene) pair — is the right representation but fails in practice. A free CPU
feature test (
examples/grn_feature_test.py: OmniPath signaling + CollecTRI, 70k edges) scored network reachability/proximity at DE AUROC 0.510 = chance on full train (equal to a pert-hubness control), and signed propagation gave DIR 0.549 with a 95% CI that crosses 0.50 on only 15% of rows. Two structural reasons: thin coverage (16% of pairs reachable) and reachability saturation (in a dense network almost everything connects, so topology can't discriminate without quantitative, macrophage-context edge weights — which is the STATE gap again). - Seed ensembling looked good offline (+0.012) but lost on the public LB (0.569 → 0.551). That drop is within the public-LB noise band (SE ≈ ±0.025), so it's a non-result — but the projected edge was below measurement resolution, so I reverted to the single best seed. The only lever with genuine headroom is a perturbation foundation model (STATE-class), which is a rules-gray-zone and out of scope here.
Every claim above is backed by a disk-cached, paired offline benchmark
(examples/benchmark_track_b*.py, run on a blinded train sample that simulates the
disjoint split) — I never burned a full leaderboard submission to test a hypothesis.
| Path | What |
|---|---|
examples/track_b_adversarial.py |
the adversarial-debate agent (DSPy-free, text tool-calling) |
examples/tools/ |
dossier tools: pathway_neighbors, gene_classify, base_rates, pubmed_search, rag_search |
examples/benchmark_track_b*.py |
blinded, paired, resumable offline benchmark harness |
examples/grn_feature_test.py |
GRN reachability/propagation feature test (the indirect-effect negative) |
examples/build_ensemble_submission.py |
seed-ensemble merger (stdlib only) |
examples/track_a_logprobs.py |
logprob-softmax Track A variant |
docs/track_b_architecture.md |
architecture write-up |
slurm/ |
Slurm job scripts (CSHL Elzar; paths/env are cluster-specific) |
outputs/track_*/ |
shipped submission bundles (.zip + prompt.txt); per-row caches are gitignored |
Participants are given (perturbation, gene) pairs and must predict a ternary effect on the target gene:
- up — upregulated
- down — downregulated
- none — not significantly affected
Ground-truth labels use a 5% FDR threshold and |shrunken log2FC| >= log2(1.5).
Submissions provide two probabilities per row: prediction_up and prediction_down. P(none) is implicitly 1 - prediction_up - prediction_down.
The competition is hosted on Kaggle with three separate tracks:
| Track | Name | Model | Key constraint |
|---|---|---|---|
| A | Prompt-only | GPT-OSS-120B (fixed) | Single prompt, 3 seeds, no tools |
| B | Agentic tool-use | GPT-OSS-120B (fixed) | Tools allowed, max 250 calls |
| C | Fine-tuning | Open model < 10B parameters | Any fine-tuning, no tools at inference |
git clone https://github.com/genentech/bioreasoningchallenge.git
cd bioreasoningchallenge
uv sync # core deps only (prompts, parsing, submission)This installs the mlgenx helper package, which provides prompt generation and answer parsing.
Track C has separate dependency groups for fine-tuning and serving (they require
incompatible transformers versions and cannot be installed together):
uv sync --extra train # fine-tuning: torch, transformers 5.x, trl, peft, …
uv sync --extra serve # serving: vllm (brings transformers 4.x)All competition data lives in data/:
| File | Description |
|---|---|
train.csv |
Training data with labels (id, pert, gene, label) — label is up, down, or none |
test.csv |
Test data without labels (id, pert, gene) |
sample_submission.csv |
Minimal submission template (id, prediction_up, prediction_down) |
sample_submission_track_a.csv |
Track A template with per-seed columns |
sample_submission_track_b.csv |
Track B template with tool-call columns |
sample_submission_track_c.csv |
Track C template with model-name column |
Row IDs are {perturbation}_{gene}, e.g. Aars_Actb or Stat1_Irf1.
See kaggle_data_description.md for full data documentation.
| Split | Perturbations | Rows | Labels (train) |
|---|---|---|---|
| Train | 386 | 7,705 | 2,359 up, 1,086 down, 4,260 none |
| Test (validation + test) | 96 | 1,813 | — |
Splits are disjoint along both the perturbation axis (80/10/10) and the gene axis (60/20/20). No gene appears in more than one split.
- Model: GPT-OSS-120B (fixed, no fine-tuning)
- Sampling:
temperature=1.0, top_p=1.0 - Format: Single prompt per question, max 4,096 prompt tokens
- Seeds: 3 samples per question (seeds 42, 43, 44); final prediction = average of
prediction_up/prediction_downacross seeds - Submission:
submission.csv+prompt.txtin a zip
- Model: GPT-OSS-120B (fixed, no fine-tuning)
- Sampling:
temperature=1.0, top_p=1.0 - Format: Prompt + tools + input question, max 4,096 prompt tokens
- Limits: Max 100 distinct tools, max 250 tool calls per question
- Submission:
submission.csv+tools/folder +prompt.txtin a zip
- Model: Open model < 10B parameters (e.g., Qwen3-4B-Thinking-2507), any fine-tuning allowed
- Format: Prompt + input question, max 16,000 new tokens at inference
- Allowed: SFT/LoRA, RL, process reward models, critic reranking, best-of-N
- Not allowed: Tools, web access, or external models during inference
- Submission:
submission.csv+prompt.txtin a zip
Tracks A and B use a fixed model that you serve locally via vLLM:
uv sync --extra serve
uv run --extra serve vllm serve openai/gpt-oss-120b \
--port 8000 \
--enforce-eager \
--no-enable-prefix-cachingThe model is ~120B parameters with mxfp4 quantization (~60 GB of weights).
Use --tensor-parallel-size <N> to shard across multiple GPUs if a single GPU
does not have enough memory. Two GPUs with ~80 GB each (e.g. A100-80G, H100,
B200) are sufficient with --tensor-parallel-size 2.
Important server flags:
--enforce-eager— Disables CUDA graph capture. Without this flag, GPT-OSS hits a known vLLM bug where the first 1--2 requests succeed but subsequent requests returncontent: nullwithfinish_reason: "length"despite tokens being generated server-side. The bug is triggered by CUDA graphs interacting with prefix caching and the attention-sink mechanism.
--no-enable-prefix-caching— Recommended by the vLLM GPT-OSS recipe for consistent behavior.
The first run downloads model weights from Hugging Face.
Set HF_HOME to a partition with at least 120 GB of free disk space before
starting the server. If the download is interrupted (e.g. disk full), the
cached snapshot may be left in an inconsistent state -- delete the partial cache
directory under $HF_HOME/hub/models--openai--gpt-oss-120b/ and retry.
GPT-OSS-120B is a reasoning model. Use max_completion_tokens (not the
deprecated max_tokens) in your API requests to set the output budget for
reasoning + visible answer combined. Set reasoning_effort to control how
much the model reasons before answering:
reasoning_effort |
Behavior | Typical tokens |
|---|---|---|
"low" |
Brief reasoning, fast responses | 30--100 |
"medium" |
Moderate reasoning | 200--2,000 |
"high" |
Extended reasoning, highest quality | 1,000--10,000+ |
Key parameter: max_completion_tokens vs max_tokens. For reasoning
models, max_completion_tokens correctly budgets reasoning and visible output
together. Using the legacy max_tokens parameter causes the model to consume
the entire budget on reasoning without producing a visible answer.
The API response separates reasoning from the final answer:
{
"choices": [{
"message": {
"reasoning": "... internal chain-of-thought ...",
"content": "... final answer ..."
},
"finish_reason": "stop"
}]
}When the model runs out of tokens during reasoning, both reasoning and
content will be null.
Calls the LLM with 3 seeds (42, 43, 44), averages the predictions, and packages a zip.
Use --concurrency N to send multiple requests in parallel for faster runs.
# Default: uses mlgenx built-in prompts
uv run python examples/track_a_prompt_only.py --api-base http://localhost:8000/v1
# Parallel requests (much faster)
uv run python examples/track_a_prompt_only.py --api-base http://localhost:8000/v1 --concurrency 20
# Use a custom prompt template (placeholders: {pert}, {gene}, {cell_desc})
uv run python examples/track_a_prompt_only.py --prompt-template examples/prompt_template.txt ...
# Use a CSV/JSONL of pre-written per-row prompts (columns: id, prompt)
uv run python examples/track_a_prompt_only.py --prompts-csv examples/example_prompts.csv ...See examples/prompt_template.txt and examples/example_prompts.csv for input format examples.
Runs an agentic loop where the LLM can call tools between reasoning steps.
uv run python examples/track_b_agentic.py --api-base http://localhost:8000/v1Three example tools are provided in examples/tools/:
| Tool | Source | Description |
|---|---|---|
train_data_lookup |
Local train.csv |
Look up known labels for a perturbation or gene |
gene_info |
mygene.info API | Retrieve gene annotations (summary, GO terms, pathways) |
protein_interactions |
STRING DB API | Query protein-protein interaction partners |
A metric-aligned debate agent: a moderator gathers a shared evidence dossier, then runs two
adversarial sub-debates (EFFECT vs NULL → P(DE); UP vs DOWN → P(up|DE)) whose judges emit
calibrated probabilities combined as prediction_up = P(DE)·P(up|DE). See the architecture
write-up (Markdown ·
rendered HTML) for the full diagram and design rationale.
Validate configs offline first with examples/benchmark_track_b.py.
A multi-agent version of Track B where a coordinator agent delegates to specialist sub-agents, each backed by the same LLM via DSPy ReAct:
biology_expert— sub-agent withgene_infoandprotein_interactionstoolsdata_analyst— sub-agent withlookup_pertandlookup_genetools
The coordinator consults one or both specialists, synthesizes their findings, and calls
submit_answer. All traces are captured hierarchically:
{"coordinator": {...}, "sub_agents": [...]}. Token and tool-call counts aggregate
across all agents.
uv run python examples/track_b_multiagent.py --api-base http://localhost:8000/v1
# Tune iteration budgets
uv run python examples/track_b_multiagent.py \
--api-base http://localhost:8000/v1 \
--max-iters 20 --max-sub-iters 5Track C is a two-step workflow. Fine-tuning and serving require different
dependency sets (train vs serve extras) because trl needs
transformers>=5.3 while vLLM requires transformers<5. Switch between them
by re-running uv sync with the appropriate extra.
Step 1: Fine-tune (run once, needs a GPU)
uv sync --extra train
uv run --extra train python examples/finetune.py \
--train-csv data/train.csv \
--model-id Qwen/Qwen3-4B-Thinking-2507 \
--output-dir outputs/finetuned_model \
--epochs 3 --lr 2e-4 --lora-rank 16This produces a merged LoRA model in outputs/finetuned_model/.
Step 1b: Patch tokenizer (one-time fix after fine-tuning)
The train extra uses transformers>=5.3, which saves extra_special_tokens
in a format incompatible with the transformers 4.x bundled by vLLM. Run this
once after fine-tuning to fix the tokenizer config:
python -c "
import json; from pathlib import Path
p = Path('outputs/finetuned_model/tokenizer_config.json')
cfg = json.loads(p.read_text())
est = cfg.get('extra_special_tokens')
if isinstance(est, list):
cfg['extra_special_tokens'] = {t: t for t in est} if est else {}
p.write_text(json.dumps(cfg, indent=2, ensure_ascii=False))
print(f'Fixed: converted list of {len(est)} tokens to dict')
else:
print('No fix needed')
"Step 2: Serve and run inference (needs a GPU)
uv sync --extra serve
# Serve with vLLM
uv run --extra serve vllm serve outputs/finetuned_model --port 8000
# In another terminal -- generate predictions
uv run --extra serve python examples/track_c_finetune.py \
--api-base http://localhost:8000/v1 \
--model outputs/finetuned_model \
--base-model Qwen/Qwen3-4B-Thinking-2507Use the example scripts above or write your own. Each script outputs a zip file ready for Kaggle upload.
Each track requires specific columns in submission.csv:
Track A columns: id, prediction_up, prediction_down, prediction_up_seed42, prediction_down_seed42, prediction_up_seed43, prediction_down_seed43, prediction_up_seed44, prediction_down_seed44, reasoning_trace_seed42, reasoning_trace_seed43, reasoning_trace_seed44, tokens_used, model_name
Track B columns: id, prediction_up, prediction_down, reasoning_trace, tokens_used, num_tool_calls, prompt_tokens, num_distinct_tools, model_name
Track C columns: id, prediction_up, prediction_down, reasoning_trace, tokens_used, model_name
The id column must match every row in test.csv exactly. Only id, prediction_up, and prediction_down are used for scoring; all other columns are required metadata. Submissions missing required metadata columns will receive a score of 0.
No null values allowed. Every cell must be filled. For rows where the model
returned an empty response, use "none" for reasoning traces and 0 for token
counts. The example scripts handle this automatically.
# Track A zip contents:
submission.csv
prompt.txt
# Track B zip contents:
submission.csv
prompt.txt
tools/*.py
# Track C zip contents:
submission.csv
prompt.txt
Go to the competition page on Kaggle and upload your zip file.
The competition metric is the average of two micro AUROCs computed from the ternary labels:
- DE AUROC: (up + down) vs none, using score
prediction_up + prediction_down. - DIR AUROC: up vs down among DE-positive rows, using score
prediction_up / (prediction_up + prediction_down)(conditional probability of up given DE).
score = (DE_AUROC + DIR_AUROC) / 2
- Random baseline (reasonable spread across classes): near chance on both components
- Perfect model: 1.0
Submissions that omit required metadata columns (reasoning traces, token counts, etc.) will score 0.0.
from mlgenx import format_prompt, parse_answer, build_submission
# Generate a prompt
prompt = format_prompt("Aars", "Actb")
# ... send to LLM, get response_text ...
# Parse the response
prediction_up, prediction_down = parse_answer(response_text)
# Build a submission
df = build_submission(ids, predictions_up, predictions_down, output_path="submission.csv")from mlgenx import format_prompts_from_csv
prompts_df = format_prompts_from_csv("data/test.csv")
# DataFrame with columns: id, promptprompt = format_prompt("Aars", "Actb", examples=[
{"pert": "Brca1", "gene": "Tp53", "label": "none"},
{"pert": "Myc", "gene": "Cdkn1a", "label": "up"},
])| Function | Description |
|---|---|
format_prompt(pert, gene, examples=None) |
Generate a single LLM prompt (zero-shot or few-shot) |
format_prompts_from_csv(csv_path, examples=None) |
Generate prompts for all rows in a CSV |
parse_answer(text, default=(0.333, 0.333)) |
Parse one LLM response into (prediction_up, prediction_down) |
parse_answers(texts, default=(0.333, 0.333)) |
Parse a list of LLM responses |
build_submission(ids, predictions_up, predictions_down, output_path=None) |
Assemble a submission DataFrame/CSV |
- Data format inspired by PerturbQA (Wu et al., ICLR 2025)
- Source data: CRISPRi Perturb-seq in mouse BMDMs

