BioReasoning Challenge -- MLGenX LLM Perturbation Competition

Predict gene expression changes from CRISPRi perturbations in mouse bone marrow-derived macrophages (BMDMs).

Website

Please checkout the website for full details!

My submission (Jiawei Xing)

This repository is my competition entry, built on top of the organizers' starter kit (upstream: genentech/bioreasoningchallenge). The sections below the divider are the original starter-kit docs; everything in examples/track_b_adversarial.py, examples/tools/, the benchmark harness, and docs/track_b_architecture.md is my own work.

Task. Given a (perturbation, gene) pair, predict the ternary effect of CRISPRi knockdown on the target gene in mouse BMDMs (up / down / none). Scored as the mean of two micro-AUROCs: DE (any effect vs none) and DIR (up vs down among DE-positive rows). The train/test split is disjoint on both the perturbation and gene axes, so no target gene's behavior can be memorized from training — the model must reason about an unseen (pert, gene) interaction.

Track B — adversarial-debate agent (the centerpiece)

Because the metric is two independent AUROCs, I decomposed the prediction into two matched debates instead of one classifier. Per row, a moderator deterministically gathers a shared evidence dossier via tools, then runs:

Debate 1 → DE: an EFFECT advocate vs a NULL advocate, scored by a judge → P(DE)
Debate 2 → DIR: an UP advocate vs a DOWN advocate → P(up | DE)
prediction_up = P(DE)·P(up|DE), prediction_down = P(DE)·(1−P(up|DE))

Judges emit continuous calibrated probabilities (AUROC rewards ranking, not hard labels). The key tool is pathway_neighbors — since exact lookups return nothing on the disjoint split, it finds a perturbation's STRING network partners that do appear in train and pools their knockdown label distribution (analogy, not memorization). See docs/track_b_architecture.md for the full design.

Result: public leaderboard ≈ 0.569 on Track B.

Tracks A & C

Track A (prompt-only, GPT-OSS-120B): derive continuous probabilities from the softmax over the answer-token logprobs (track_a_logprobs.py) rather than parsing a hard label, then average over 3 seeds — turning AUROC's appetite for ranking into a scoring lever.
Track C (fine-tuning): LoRA SFT of a <10B open model, served with vLLM at inference.

What I learned (incl. honest negatives)

Rigor here is mostly about not chasing noise:

DE detection is information-limited (~0.55), not prompt/effort/retrieval-limited. Sharpening the judge to suppress "famous-regulator" false positives was a real, isolated win; literature RAG, a learned feature combiner, and higher judge reasoning-effort all came back negative/noise in paired offline benchmarks.
A gene regulatory network doesn't recover the indirect effects either. The natural next idea — propagate a knockdown through a directed GRN instead of needing a documented (pert, gene) pair — is the right representation but fails in practice. A free CPU feature test (examples/grn_feature_test.py: OmniPath signaling + CollecTRI, 70k edges) scored network reachability/proximity at DE AUROC 0.510 = chance on full train (equal to a pert-hubness control), and signed propagation gave DIR 0.549 with a 95% CI that crosses 0.50 on only 15% of rows. Two structural reasons: thin coverage (16% of pairs reachable) and reachability saturation (in a dense network almost everything connects, so topology can't discriminate without quantitative, macrophage-context edge weights — which is the STATE gap again).
Seed ensembling looked good offline (+0.012) but lost on the public LB (0.569 → 0.551). That drop is within the public-LB noise band (SE ≈ ±0.025), so it's a non-result — but the projected edge was below measurement resolution, so I reverted to the single best seed. The only lever with genuine headroom is a perturbation foundation model (STATE-class), which is a rules-gray-zone and out of scope here.

Every claim above is backed by a disk-cached, paired offline benchmark (examples/benchmark_track_b*.py, run on a blinded train sample that simulates the disjoint split) — I never burned a full leaderboard submission to test a hypothesis.

Repo layout (my additions)

Path	What
`examples/track_b_adversarial.py`	the adversarial-debate agent (DSPy-free, text tool-calling)
`examples/tools/`	dossier tools: `pathway_neighbors`, `gene_classify`, `base_rates`, `pubmed_search`, `rag_search`
`examples/benchmark_track_b*.py`	blinded, paired, resumable offline benchmark harness
`examples/grn_feature_test.py`	GRN reachability/propagation feature test (the indirect-effect negative)
`examples/build_ensemble_submission.py`	seed-ensemble merger (stdlib only)
`examples/track_a_logprobs.py`	logprob-softmax Track A variant
`docs/track_b_architecture.md`	architecture write-up
`slurm/`	Slurm job scripts (CSHL Elzar; paths/env are cluster-specific)
`outputs/track_*/`	shipped submission bundles (`.zip` + `prompt.txt`); per-row caches are gitignored

Overview

Participants are given (perturbation, gene) pairs and must predict a ternary effect on the target gene:

up — upregulated
down — downregulated
none — not significantly affected

Ground-truth labels use a 5% FDR threshold and |shrunken log2FC| >= log2(1.5).

Submissions provide two probabilities per row: prediction_up and prediction_down. P(none) is implicitly 1 - prediction_up - prediction_down.

The competition is hosted on Kaggle with three separate tracks:

Track	Name	Model	Key constraint
A	Prompt-only	GPT-OSS-120B (fixed)	Single prompt, 3 seeds, no tools
B	Agentic tool-use	GPT-OSS-120B (fixed)	Tools allowed, max 250 calls
C	Fine-tuning	Open model < 10B parameters	Any fine-tuning, no tools at inference

Installation

git clone https://github.com/genentech/bioreasoningchallenge.git
cd bioreasoningchallenge
uv sync            # core deps only (prompts, parsing, submission)

This installs the mlgenx helper package, which provides prompt generation and answer parsing.

Track C has separate dependency groups for fine-tuning and serving (they require incompatible transformers versions and cannot be installed together):

uv sync --extra train   # fine-tuning: torch, transformers 5.x, trl, peft, …
uv sync --extra serve   # serving:     vllm (brings transformers 4.x)

Data

All competition data lives in data/:

File	Description
`train.csv`	Training data with labels (`id, pert, gene, label`) — label is `up`, `down`, or `none`
`test.csv`	Test data without labels (`id, pert, gene`)
`sample_submission.csv`	Minimal submission template (`id, prediction_up, prediction_down`)
`sample_submission_track_a.csv`	Track A template with per-seed columns
`sample_submission_track_b.csv`	Track B template with tool-call columns
`sample_submission_track_c.csv`	Track C template with model-name column

Row IDs are {perturbation}_{gene}, e.g. Aars_Actb or Stat1_Irf1.

See kaggle_data_description.md for full data documentation.

Dataset size

Split	Perturbations	Rows	Labels (train)
Train	386	7,705	2,359 up, 1,086 down, 4,260 none
Test (validation + test)	96	1,813	—

Splits are disjoint along both the perturbation axis (80/10/10) and the gene axis (60/20/20). No gene appears in more than one split.

Tracks

Track A -- Prompt-only

Model: GPT-OSS-120B (fixed, no fine-tuning)
Sampling: temperature=1.0, top_p=1.0
Format: Single prompt per question, max 4,096 prompt tokens
Seeds: 3 samples per question (seeds 42, 43, 44); final prediction = average of prediction_up / prediction_down across seeds
Submission: submission.csv + prompt.txt in a zip

Track B -- Agentic tool-use

Model: GPT-OSS-120B (fixed, no fine-tuning)
Sampling: temperature=1.0, top_p=1.0
Format: Prompt + tools + input question, max 4,096 prompt tokens
Limits: Max 100 distinct tools, max 250 tool calls per question
Submission: submission.csv + tools/ folder + prompt.txt in a zip

Track C -- Fine-tuning

Model: Open model < 10B parameters (e.g., Qwen3-4B-Thinking-2507), any fine-tuning allowed
Format: Prompt + input question, max 16,000 new tokens at inference
Allowed: SFT/LoRA, RL, process reward models, critic reranking, best-of-N
Not allowed: Tools, web access, or external models during inference
Submission: submission.csv + prompt.txt in a zip

Serving GPT-OSS-120B (Tracks A & B)

Tracks A and B use a fixed model that you serve locally via vLLM:

uv sync --extra serve

uv run --extra serve vllm serve openai/gpt-oss-120b \
    --port 8000 \
    --enforce-eager \
    --no-enable-prefix-caching

The model is ~120B parameters with mxfp4 quantization (~60 GB of weights). Use --tensor-parallel-size <N> to shard across multiple GPUs if a single GPU does not have enough memory. Two GPUs with ~80 GB each (e.g. A100-80G, H100, B200) are sufficient with --tensor-parallel-size 2.

Important server flags:

--enforce-eager — Disables CUDA graph capture. Without this flag, GPT-OSS hits a known vLLM bug where the first 1--2 requests succeed but subsequent requests return content: null with finish_reason: "length" despite tokens being generated server-side. The bug is triggered by CUDA graphs interacting with prefix caching and the attention-sink mechanism.

--no-enable-prefix-caching — Recommended by the vLLM GPT-OSS recipe for consistent behavior.

The first run downloads model weights from Hugging Face. Set HF_HOME to a partition with at least 120 GB of free disk space before starting the server. If the download is interrupted (e.g. disk full), the cached snapshot may be left in an inconsistent state -- delete the partial cache directory under $HF_HOME/hub/models--openai--gpt-oss-120b/ and retry.

Reasoning model behavior

GPT-OSS-120B is a reasoning model. Use max_completion_tokens (not the deprecated max_tokens) in your API requests to set the output budget for reasoning + visible answer combined. Set reasoning_effort to control how much the model reasons before answering:

`reasoning_effort`	Behavior	Typical tokens
`"low"`	Brief reasoning, fast responses	30--100
`"medium"`	Moderate reasoning	200--2,000
`"high"`	Extended reasoning, highest quality	1,000--10,000+

Key parameter: max_completion_tokens vs max_tokens. For reasoning models, max_completion_tokens correctly budgets reasoning and visible output together. Using the legacy max_tokens parameter causes the model to consume the entire budget on reasoning without producing a visible answer.

The API response separates reasoning from the final answer:

{
  "choices": [{
    "message": {
      "reasoning": "... internal chain-of-thought ...",
      "content": "... final answer ..."
    },
    "finish_reason": "stop"
  }]
}

When the model runs out of tokens during reasoning, both reasoning and content will be null.

Example Scripts

Track A -- `examples/track_a_prompt_only.py`

Calls the LLM with 3 seeds (42, 43, 44), averages the predictions, and packages a zip. Use --concurrency N to send multiple requests in parallel for faster runs.

# Default: uses mlgenx built-in prompts
uv run python examples/track_a_prompt_only.py --api-base http://localhost:8000/v1

# Parallel requests (much faster)
uv run python examples/track_a_prompt_only.py --api-base http://localhost:8000/v1 --concurrency 20

# Use a custom prompt template (placeholders: {pert}, {gene}, {cell_desc})
uv run python examples/track_a_prompt_only.py --prompt-template examples/prompt_template.txt ...

# Use a CSV/JSONL of pre-written per-row prompts (columns: id, prompt)
uv run python examples/track_a_prompt_only.py --prompts-csv examples/example_prompts.csv ...

See examples/prompt_template.txt and examples/example_prompts.csv for input format examples.

Track B -- `examples/track_b_agentic.py`

Runs an agentic loop where the LLM can call tools between reasoning steps.

uv run python examples/track_b_agentic.py --api-base http://localhost:8000/v1

Three example tools are provided in examples/tools/:

Tool	Source	Description
`train_data_lookup`	Local `train.csv`	Look up known labels for a perturbation or gene
`gene_info`	mygene.info API	Retrieve gene annotations (summary, GO terms, pathways)
`protein_interactions`	STRING DB API	Query protein-protein interaction partners

Track B (adversarial-debate variant) -- `examples/track_b_adversarial.py`

A metric-aligned debate agent: a moderator gathers a shared evidence dossier, then runs two adversarial sub-debates (EFFECT vs NULL → P(DE); UP vs DOWN → P(up|DE)) whose judges emit calibrated probabilities combined as prediction_up = P(DE)·P(up|DE). See the architecture write-up (Markdown · rendered HTML) for the full diagram and design rationale. Validate configs offline first with examples/benchmark_track_b.py.

Track B (multi-agent variant) -- `examples/track_b_multiagent.py`

A multi-agent version of Track B where a coordinator agent delegates to specialist sub-agents, each backed by the same LLM via DSPy ReAct:

biology_expert — sub-agent with gene_info and protein_interactions tools
data_analyst — sub-agent with lookup_pert and lookup_gene tools

The coordinator consults one or both specialists, synthesizes their findings, and calls submit_answer. All traces are captured hierarchically: {"coordinator": {...}, "sub_agents": [...]}. Token and tool-call counts aggregate across all agents.

uv run python examples/track_b_multiagent.py --api-base http://localhost:8000/v1

# Tune iteration budgets
uv run python examples/track_b_multiagent.py \
    --api-base http://localhost:8000/v1 \
    --max-iters 20 --max-sub-iters 5

Track C -- `examples/finetune.py` + `examples/track_c_finetune.py`

Track C is a two-step workflow. Fine-tuning and serving require different dependency sets (train vs serve extras) because trl needs transformers>=5.3 while vLLM requires transformers<5. Switch between them by re-running uv sync with the appropriate extra.

Step 1: Fine-tune (run once, needs a GPU)

uv sync --extra train

uv run --extra train python examples/finetune.py \
    --train-csv data/train.csv \
    --model-id Qwen/Qwen3-4B-Thinking-2507 \
    --output-dir outputs/finetuned_model \
    --epochs 3 --lr 2e-4 --lora-rank 16

This produces a merged LoRA model in outputs/finetuned_model/.

Step 1b: Patch tokenizer (one-time fix after fine-tuning)

The train extra uses transformers>=5.3, which saves extra_special_tokens in a format incompatible with the transformers 4.x bundled by vLLM. Run this once after fine-tuning to fix the tokenizer config:

python -c "
import json; from pathlib import Path
p = Path('outputs/finetuned_model/tokenizer_config.json')
cfg = json.loads(p.read_text())
est = cfg.get('extra_special_tokens')
if isinstance(est, list):
    cfg['extra_special_tokens'] = {t: t for t in est} if est else {}
    p.write_text(json.dumps(cfg, indent=2, ensure_ascii=False))
    print(f'Fixed: converted list of {len(est)} tokens to dict')
else:
    print('No fix needed')
"

Step 2: Serve and run inference (needs a GPU)

uv sync --extra serve

# Serve with vLLM
uv run --extra serve vllm serve outputs/finetuned_model --port 8000

# In another terminal -- generate predictions
uv run --extra serve python examples/track_c_finetune.py \
    --api-base http://localhost:8000/v1 \
    --model outputs/finetuned_model \
    --base-model Qwen/Qwen3-4B-Thinking-2507

How to Submit

Step 1: Generate predictions

Use the example scripts above or write your own. Each script outputs a zip file ready for Kaggle upload.

Step 2: Verify your submission

Each track requires specific columns in submission.csv:

Track A columns: id, prediction_up, prediction_down, prediction_up_seed42, prediction_down_seed42, prediction_up_seed43, prediction_down_seed43, prediction_up_seed44, prediction_down_seed44, reasoning_trace_seed42, reasoning_trace_seed43, reasoning_trace_seed44, tokens_used, model_name

Track B columns: id, prediction_up, prediction_down, reasoning_trace, tokens_used, num_tool_calls, prompt_tokens, num_distinct_tools, model_name

Track C columns: id, prediction_up, prediction_down, reasoning_trace, tokens_used, model_name

The id column must match every row in test.csv exactly. Only id, prediction_up, and prediction_down are used for scoring; all other columns are required metadata. Submissions missing required metadata columns will receive a score of 0.

No null values allowed. Every cell must be filled. For rows where the model returned an empty response, use "none" for reasoning traces and 0 for token counts. The example scripts handle this automatically.

Step 3: Package into a zip

# Track A zip contents:
submission.csv
prompt.txt

# Track B zip contents:
submission.csv
prompt.txt
tools/*.py

# Track C zip contents:
submission.csv
prompt.txt

Step 4: Upload to Kaggle

Go to the competition page on Kaggle and upload your zip file.

Evaluation

The competition metric is the average of two micro AUROCs computed from the ternary labels:

DE AUROC: (up + down) vs none, using score prediction_up + prediction_down.
DIR AUROC: up vs down among DE-positive rows, using score prediction_up / (prediction_up + prediction_down) (conditional probability of up given DE).

score = (DE_AUROC + DIR_AUROC) / 2

Random baseline (reasonable spread across classes): near chance on both components
Perfect model: 1.0

Submissions that omit required metadata columns (reasoning traces, token counts, etc.) will score 0.0.

Quick Start

from mlgenx import format_prompt, parse_answer, build_submission

# Generate a prompt
prompt = format_prompt("Aars", "Actb")

# ... send to LLM, get response_text ...

# Parse the response
prediction_up, prediction_down = parse_answer(response_text)

# Build a submission
df = build_submission(ids, predictions_up, predictions_down, output_path="submission.csv")

Batch prompt generation

from mlgenx import format_prompts_from_csv

prompts_df = format_prompts_from_csv("data/test.csv")
# DataFrame with columns: id, prompt

Few-shot prompting

prompt = format_prompt("Aars", "Actb", examples=[
    {"pert": "Brca1", "gene": "Tp53", "label": "none"},
    {"pert": "Myc", "gene": "Cdkn1a", "label": "up"},
])

API Reference

Function	Description
`format_prompt(pert, gene, examples=None)`	Generate a single LLM prompt (zero-shot or few-shot)
`format_prompts_from_csv(csv_path, examples=None)`	Generate prompts for all rows in a CSV
`parse_answer(text, default=(0.333, 0.333))`	Parse one LLM response into `(prediction_up, prediction_down)`
`parse_answers(texts, default=(0.333, 0.333))`	Parse a list of LLM responses
`build_submission(ids, predictions_up, predictions_down, output_path=None)`	Assemble a submission DataFrame/CSV

References

Data format inspired by PerturbQA (Wu et al., ICLR 2025)
Source data: CRISPRi Perturb-seq in mouse BMDMs

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
docs		docs
examples		examples
mlgenx		mlgenx
outputs		outputs
slurm		slurm
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
kaggle_data_description.md		kaggle_data_description.md
kaggle_metric.py		kaggle_metric.py
kaggle_metric_track_a.py		kaggle_metric_track_a.py
kaggle_metric_track_b.py		kaggle_metric_track_b.py
kaggle_metric_track_c.py		kaggle_metric_track_c.py
pyproject.toml		pyproject.toml
serve_with_logprobs_fix.py		serve_with_logprobs_fix.py

Folders and files

Latest commit

History

Repository files navigation

BioReasoning Challenge -- MLGenX LLM Perturbation Competition

Website

My submission (Jiawei Xing)

Track B — adversarial-debate agent (the centerpiece)

Tracks A & C

What I learned (incl. honest negatives)

Repo layout (my additions)

Overview

Installation

Data

Dataset size

Tracks

Track A -- Prompt-only

Track B -- Agentic tool-use

Track C -- Fine-tuning

Serving GPT-OSS-120B (Tracks A & B)

Reasoning model behavior

Example Scripts

Track A -- examples/track_a_prompt_only.py

Track B -- examples/track_b_agentic.py

Track B (adversarial-debate variant) -- examples/track_b_adversarial.py

Track B (multi-agent variant) -- examples/track_b_multiagent.py

Track C -- examples/finetune.py + examples/track_c_finetune.py

How to Submit

Step 1: Generate predictions

Step 2: Verify your submission

Step 3: Package into a zip

Step 4: Upload to Kaggle

Evaluation

Quick Start

Batch prompt generation

Few-shot prompting

API Reference

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Track A -- `examples/track_a_prompt_only.py`

Track B -- `examples/track_b_agentic.py`

Track B (adversarial-debate variant) -- `examples/track_b_adversarial.py`

Track B (multi-agent variant) -- `examples/track_b_multiagent.py`

Track C -- `examples/finetune.py` + `examples/track_c_finetune.py`

Packages