Skip to content

Jiawei-Xing/PerturbAgent

Repository files navigation

BioReasoning Challenge -- MLGenX LLM Perturbation Competition

BioReasoning Challenge overview

Predict gene expression changes from CRISPRi perturbations in mouse bone marrow-derived macrophages (BMDMs).

Website

Please checkout the website for full details!


My submission (Jiawei Xing)

This repository is my competition entry, built on top of the organizers' starter kit (upstream: genentech/bioreasoningchallenge). The sections below the divider are the original starter-kit docs; everything in examples/track_b_adversarial.py, examples/tools/, the benchmark harness, and docs/track_b_architecture.md is my own work.

Task. Given a (perturbation, gene) pair, predict the ternary effect of CRISPRi knockdown on the target gene in mouse BMDMs (up / down / none). Scored as the mean of two micro-AUROCs: DE (any effect vs none) and DIR (up vs down among DE-positive rows). The train/test split is disjoint on both the perturbation and gene axes, so no target gene's behavior can be memorized from training — the model must reason about an unseen (pert, gene) interaction.

Track B — adversarial-debate agent (the centerpiece)

Because the metric is two independent AUROCs, I decomposed the prediction into two matched debates instead of one classifier. Per row, a moderator deterministically gathers a shared evidence dossier via tools, then runs:

  • Debate 1 → DE: an EFFECT advocate vs a NULL advocate, scored by a judge → P(DE)
  • Debate 2 → DIR: an UP advocate vs a DOWN advocate → P(up | DE)
  • prediction_up = P(DE)·P(up|DE), prediction_down = P(DE)·(1−P(up|DE))

Judges emit continuous calibrated probabilities (AUROC rewards ranking, not hard labels). The key tool is pathway_neighbors — since exact lookups return nothing on the disjoint split, it finds a perturbation's STRING network partners that do appear in train and pools their knockdown label distribution (analogy, not memorization). See docs/track_b_architecture.md for the full design.

Track B adversarial-debate agent architecture

Result: public leaderboard ≈ 0.569 on Track B.

Tracks A & C

  • Track A (prompt-only, GPT-OSS-120B): derive continuous probabilities from the softmax over the answer-token logprobs (track_a_logprobs.py) rather than parsing a hard label, then average over 3 seeds — turning AUROC's appetite for ranking into a scoring lever.
  • Track C (fine-tuning): LoRA SFT of a <10B open model, served with vLLM at inference.

What I learned (incl. honest negatives)

Rigor here is mostly about not chasing noise:

  • DE detection is information-limited (~0.55), not prompt/effort/retrieval-limited. Sharpening the judge to suppress "famous-regulator" false positives was a real, isolated win; literature RAG, a learned feature combiner, and higher judge reasoning-effort all came back negative/noise in paired offline benchmarks.
  • A gene regulatory network doesn't recover the indirect effects either. The natural next idea — propagate a knockdown through a directed GRN instead of needing a documented (pert, gene) pair — is the right representation but fails in practice. A free CPU feature test (examples/grn_feature_test.py: OmniPath signaling + CollecTRI, 70k edges) scored network reachability/proximity at DE AUROC 0.510 = chance on full train (equal to a pert-hubness control), and signed propagation gave DIR 0.549 with a 95% CI that crosses 0.50 on only 15% of rows. Two structural reasons: thin coverage (16% of pairs reachable) and reachability saturation (in a dense network almost everything connects, so topology can't discriminate without quantitative, macrophage-context edge weights — which is the STATE gap again).
  • Seed ensembling looked good offline (+0.012) but lost on the public LB (0.569 → 0.551). That drop is within the public-LB noise band (SE ≈ ±0.025), so it's a non-result — but the projected edge was below measurement resolution, so I reverted to the single best seed. The only lever with genuine headroom is a perturbation foundation model (STATE-class), which is a rules-gray-zone and out of scope here.

Every claim above is backed by a disk-cached, paired offline benchmark (examples/benchmark_track_b*.py, run on a blinded train sample that simulates the disjoint split) — I never burned a full leaderboard submission to test a hypothesis.

Repo layout (my additions)

Path What
examples/track_b_adversarial.py the adversarial-debate agent (DSPy-free, text tool-calling)
examples/tools/ dossier tools: pathway_neighbors, gene_classify, base_rates, pubmed_search, rag_search
examples/benchmark_track_b*.py blinded, paired, resumable offline benchmark harness
examples/grn_feature_test.py GRN reachability/propagation feature test (the indirect-effect negative)
examples/build_ensemble_submission.py seed-ensemble merger (stdlib only)
examples/track_a_logprobs.py logprob-softmax Track A variant
docs/track_b_architecture.md architecture write-up
slurm/ Slurm job scripts (CSHL Elzar; paths/env are cluster-specific)
outputs/track_*/ shipped submission bundles (.zip + prompt.txt); per-row caches are gitignored

Overview

Participants are given (perturbation, gene) pairs and must predict a ternary effect on the target gene:

  • up — upregulated
  • down — downregulated
  • none — not significantly affected

Ground-truth labels use a 5% FDR threshold and |shrunken log2FC| >= log2(1.5).

Submissions provide two probabilities per row: prediction_up and prediction_down. P(none) is implicitly 1 - prediction_up - prediction_down.

The competition is hosted on Kaggle with three separate tracks:

Track Name Model Key constraint
A Prompt-only GPT-OSS-120B (fixed) Single prompt, 3 seeds, no tools
B Agentic tool-use GPT-OSS-120B (fixed) Tools allowed, max 250 calls
C Fine-tuning Open model < 10B parameters Any fine-tuning, no tools at inference

Installation

git clone https://github.com/genentech/bioreasoningchallenge.git
cd bioreasoningchallenge
uv sync            # core deps only (prompts, parsing, submission)

This installs the mlgenx helper package, which provides prompt generation and answer parsing.

Track C has separate dependency groups for fine-tuning and serving (they require incompatible transformers versions and cannot be installed together):

uv sync --extra train   # fine-tuning: torch, transformers 5.x, trl, peft, …
uv sync --extra serve   # serving:     vllm (brings transformers 4.x)

Data

All competition data lives in data/:

File Description
train.csv Training data with labels (id, pert, gene, label) — label is up, down, or none
test.csv Test data without labels (id, pert, gene)
sample_submission.csv Minimal submission template (id, prediction_up, prediction_down)
sample_submission_track_a.csv Track A template with per-seed columns
sample_submission_track_b.csv Track B template with tool-call columns
sample_submission_track_c.csv Track C template with model-name column

Row IDs are {perturbation}_{gene}, e.g. Aars_Actb or Stat1_Irf1.

See kaggle_data_description.md for full data documentation.

Dataset size

Split Perturbations Rows Labels (train)
Train 386 7,705 2,359 up, 1,086 down, 4,260 none
Test (validation + test) 96 1,813

Splits are disjoint along both the perturbation axis (80/10/10) and the gene axis (60/20/20). No gene appears in more than one split.

Tracks

Track A -- Prompt-only

  • Model: GPT-OSS-120B (fixed, no fine-tuning)
  • Sampling: temperature=1.0, top_p=1.0
  • Format: Single prompt per question, max 4,096 prompt tokens
  • Seeds: 3 samples per question (seeds 42, 43, 44); final prediction = average of prediction_up / prediction_down across seeds
  • Submission: submission.csv + prompt.txt in a zip

Track B -- Agentic tool-use

  • Model: GPT-OSS-120B (fixed, no fine-tuning)
  • Sampling: temperature=1.0, top_p=1.0
  • Format: Prompt + tools + input question, max 4,096 prompt tokens
  • Limits: Max 100 distinct tools, max 250 tool calls per question
  • Submission: submission.csv + tools/ folder + prompt.txt in a zip

Track C -- Fine-tuning

  • Model: Open model < 10B parameters (e.g., Qwen3-4B-Thinking-2507), any fine-tuning allowed
  • Format: Prompt + input question, max 16,000 new tokens at inference
  • Allowed: SFT/LoRA, RL, process reward models, critic reranking, best-of-N
  • Not allowed: Tools, web access, or external models during inference
  • Submission: submission.csv + prompt.txt in a zip

Serving GPT-OSS-120B (Tracks A & B)

Tracks A and B use a fixed model that you serve locally via vLLM:

uv sync --extra serve

uv run --extra serve vllm serve openai/gpt-oss-120b \
    --port 8000 \
    --enforce-eager \
    --no-enable-prefix-caching

The model is ~120B parameters with mxfp4 quantization (~60 GB of weights). Use --tensor-parallel-size <N> to shard across multiple GPUs if a single GPU does not have enough memory. Two GPUs with ~80 GB each (e.g. A100-80G, H100, B200) are sufficient with --tensor-parallel-size 2.

Important server flags:

  • --enforce-eager — Disables CUDA graph capture. Without this flag, GPT-OSS hits a known vLLM bug where the first 1--2 requests succeed but subsequent requests return content: null with finish_reason: "length" despite tokens being generated server-side. The bug is triggered by CUDA graphs interacting with prefix caching and the attention-sink mechanism.

  • --no-enable-prefix-caching — Recommended by the vLLM GPT-OSS recipe for consistent behavior.

The first run downloads model weights from Hugging Face. Set HF_HOME to a partition with at least 120 GB of free disk space before starting the server. If the download is interrupted (e.g. disk full), the cached snapshot may be left in an inconsistent state -- delete the partial cache directory under $HF_HOME/hub/models--openai--gpt-oss-120b/ and retry.

Reasoning model behavior

GPT-OSS-120B is a reasoning model. Use max_completion_tokens (not the deprecated max_tokens) in your API requests to set the output budget for reasoning + visible answer combined. Set reasoning_effort to control how much the model reasons before answering:

reasoning_effort Behavior Typical tokens
"low" Brief reasoning, fast responses 30--100
"medium" Moderate reasoning 200--2,000
"high" Extended reasoning, highest quality 1,000--10,000+

Key parameter: max_completion_tokens vs max_tokens. For reasoning models, max_completion_tokens correctly budgets reasoning and visible output together. Using the legacy max_tokens parameter causes the model to consume the entire budget on reasoning without producing a visible answer.

The API response separates reasoning from the final answer:

{
  "choices": [{
    "message": {
      "reasoning": "... internal chain-of-thought ...",
      "content": "... final answer ..."
    },
    "finish_reason": "stop"
  }]
}

When the model runs out of tokens during reasoning, both reasoning and content will be null.

Example Scripts

Track A -- examples/track_a_prompt_only.py

Calls the LLM with 3 seeds (42, 43, 44), averages the predictions, and packages a zip. Use --concurrency N to send multiple requests in parallel for faster runs.

# Default: uses mlgenx built-in prompts
uv run python examples/track_a_prompt_only.py --api-base http://localhost:8000/v1

# Parallel requests (much faster)
uv run python examples/track_a_prompt_only.py --api-base http://localhost:8000/v1 --concurrency 20

# Use a custom prompt template (placeholders: {pert}, {gene}, {cell_desc})
uv run python examples/track_a_prompt_only.py --prompt-template examples/prompt_template.txt ...

# Use a CSV/JSONL of pre-written per-row prompts (columns: id, prompt)
uv run python examples/track_a_prompt_only.py --prompts-csv examples/example_prompts.csv ...

See examples/prompt_template.txt and examples/example_prompts.csv for input format examples.

Track B -- examples/track_b_agentic.py

Runs an agentic loop where the LLM can call tools between reasoning steps.

uv run python examples/track_b_agentic.py --api-base http://localhost:8000/v1

Three example tools are provided in examples/tools/:

Tool Source Description
train_data_lookup Local train.csv Look up known labels for a perturbation or gene
gene_info mygene.info API Retrieve gene annotations (summary, GO terms, pathways)
protein_interactions STRING DB API Query protein-protein interaction partners

Track B (adversarial-debate variant) -- examples/track_b_adversarial.py

A metric-aligned debate agent: a moderator gathers a shared evidence dossier, then runs two adversarial sub-debates (EFFECT vs NULL → P(DE); UP vs DOWN → P(up|DE)) whose judges emit calibrated probabilities combined as prediction_up = P(DE)·P(up|DE). See the architecture write-up (Markdown · rendered HTML) for the full diagram and design rationale. Validate configs offline first with examples/benchmark_track_b.py.

Track B (multi-agent variant) -- examples/track_b_multiagent.py

A multi-agent version of Track B where a coordinator agent delegates to specialist sub-agents, each backed by the same LLM via DSPy ReAct:

  • biology_expert — sub-agent with gene_info and protein_interactions tools
  • data_analyst — sub-agent with lookup_pert and lookup_gene tools

The coordinator consults one or both specialists, synthesizes their findings, and calls submit_answer. All traces are captured hierarchically: {"coordinator": {...}, "sub_agents": [...]}. Token and tool-call counts aggregate across all agents.

uv run python examples/track_b_multiagent.py --api-base http://localhost:8000/v1

# Tune iteration budgets
uv run python examples/track_b_multiagent.py \
    --api-base http://localhost:8000/v1 \
    --max-iters 20 --max-sub-iters 5

Track C -- examples/finetune.py + examples/track_c_finetune.py

Track C is a two-step workflow. Fine-tuning and serving require different dependency sets (train vs serve extras) because trl needs transformers>=5.3 while vLLM requires transformers<5. Switch between them by re-running uv sync with the appropriate extra.

Step 1: Fine-tune (run once, needs a GPU)

uv sync --extra train

uv run --extra train python examples/finetune.py \
    --train-csv data/train.csv \
    --model-id Qwen/Qwen3-4B-Thinking-2507 \
    --output-dir outputs/finetuned_model \
    --epochs 3 --lr 2e-4 --lora-rank 16

This produces a merged LoRA model in outputs/finetuned_model/.

Step 1b: Patch tokenizer (one-time fix after fine-tuning)

The train extra uses transformers>=5.3, which saves extra_special_tokens in a format incompatible with the transformers 4.x bundled by vLLM. Run this once after fine-tuning to fix the tokenizer config:

python -c "
import json; from pathlib import Path
p = Path('outputs/finetuned_model/tokenizer_config.json')
cfg = json.loads(p.read_text())
est = cfg.get('extra_special_tokens')
if isinstance(est, list):
    cfg['extra_special_tokens'] = {t: t for t in est} if est else {}
    p.write_text(json.dumps(cfg, indent=2, ensure_ascii=False))
    print(f'Fixed: converted list of {len(est)} tokens to dict')
else:
    print('No fix needed')
"

Step 2: Serve and run inference (needs a GPU)

uv sync --extra serve

# Serve with vLLM
uv run --extra serve vllm serve outputs/finetuned_model --port 8000

# In another terminal -- generate predictions
uv run --extra serve python examples/track_c_finetune.py \
    --api-base http://localhost:8000/v1 \
    --model outputs/finetuned_model \
    --base-model Qwen/Qwen3-4B-Thinking-2507

How to Submit

Step 1: Generate predictions

Use the example scripts above or write your own. Each script outputs a zip file ready for Kaggle upload.

Step 2: Verify your submission

Each track requires specific columns in submission.csv:

Track A columns: id, prediction_up, prediction_down, prediction_up_seed42, prediction_down_seed42, prediction_up_seed43, prediction_down_seed43, prediction_up_seed44, prediction_down_seed44, reasoning_trace_seed42, reasoning_trace_seed43, reasoning_trace_seed44, tokens_used, model_name

Track B columns: id, prediction_up, prediction_down, reasoning_trace, tokens_used, num_tool_calls, prompt_tokens, num_distinct_tools, model_name

Track C columns: id, prediction_up, prediction_down, reasoning_trace, tokens_used, model_name

The id column must match every row in test.csv exactly. Only id, prediction_up, and prediction_down are used for scoring; all other columns are required metadata. Submissions missing required metadata columns will receive a score of 0.

No null values allowed. Every cell must be filled. For rows where the model returned an empty response, use "none" for reasoning traces and 0 for token counts. The example scripts handle this automatically.

Step 3: Package into a zip

# Track A zip contents:
submission.csv
prompt.txt

# Track B zip contents:
submission.csv
prompt.txt
tools/*.py

# Track C zip contents:
submission.csv
prompt.txt

Step 4: Upload to Kaggle

Go to the competition page on Kaggle and upload your zip file.

Evaluation

The competition metric is the average of two micro AUROCs computed from the ternary labels:

  • DE AUROC: (up + down) vs none, using score prediction_up + prediction_down.
  • DIR AUROC: up vs down among DE-positive rows, using score prediction_up / (prediction_up + prediction_down) (conditional probability of up given DE).
score = (DE_AUROC + DIR_AUROC) / 2
  • Random baseline (reasonable spread across classes): near chance on both components
  • Perfect model: 1.0

Submissions that omit required metadata columns (reasoning traces, token counts, etc.) will score 0.0.

Quick Start

from mlgenx import format_prompt, parse_answer, build_submission

# Generate a prompt
prompt = format_prompt("Aars", "Actb")

# ... send to LLM, get response_text ...

# Parse the response
prediction_up, prediction_down = parse_answer(response_text)

# Build a submission
df = build_submission(ids, predictions_up, predictions_down, output_path="submission.csv")

Batch prompt generation

from mlgenx import format_prompts_from_csv

prompts_df = format_prompts_from_csv("data/test.csv")
# DataFrame with columns: id, prompt

Few-shot prompting

prompt = format_prompt("Aars", "Actb", examples=[
    {"pert": "Brca1", "gene": "Tp53", "label": "none"},
    {"pert": "Myc", "gene": "Cdkn1a", "label": "up"},
])

API Reference

Function Description
format_prompt(pert, gene, examples=None) Generate a single LLM prompt (zero-shot or few-shot)
format_prompts_from_csv(csv_path, examples=None) Generate prompts for all rows in a CSV
parse_answer(text, default=(0.333, 0.333)) Parse one LLM response into (prediction_up, prediction_down)
parse_answers(texts, default=(0.333, 0.333)) Parse a list of LLM responses
build_submission(ids, predictions_up, predictions_down, output_path=None) Assemble a submission DataFrame/CSV

References

  • Data format inspired by PerturbQA (Wu et al., ICLR 2025)
  • Source data: CRISPRi Perturb-seq in mouse BMDMs

About

Adversarial-debate LLM agent for predicting CRISPRi perturbation effects on gene expression (Genentech MLGenX BioReasoning Challenge)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors