SPoT: Surgical Post-Training

Code for the paper "Surgical Post-Training: Cutting Errors, Keeping Knowledge"

SPoT successfully introduces new knowledge via an Oracle to boost LLM reasoning, while preventing catastrophic forgetting through a reward-based binary optimization objective. This work makes a deep investigation into the limitations of SFT and DPO.

Data Pipeline

The pipeline generates contrastive pairs (x, y⁻, y⁺) where y⁺ is a minimally-edited correction of the model's wrong response y⁻.

Step 1: Error Elicitation

Run inference on the DAPO-Math dataset to collect model failures:

bash scripts/run_dapo_inference.sh

Or with custom options:

python scripts/parallel_inference_dapo.py \
    --model Qwen/Qwen3-8B \
    --data_path data/dapo_5k/data.jsonl \
    --output_dir output/dapo_5k_inference \
    --num_gpus 8 \
    --temperature 0.7 \
    --top_p 0.8 \
    --max_tokens 32768

Outputs errors.jsonl (incorrect predictions) and all_results.jsonl.

Step 2: Oracle Rectification

Use Gemini 2.5 Pro to surgically correct the errors (supports resuming):

# Correction mode: correct student errors while preserving style
python scripts/correct_errors_parallel.py \
    --input output/dapo_5k_inference/errors.jsonl \
    --output data/gemini_corrected.jsonl \
    --workers 200

Evaluation

Supported Benchmarks

Task string	Benchmark	Script
`custom\|aime24\|0\|0`	AIME 2024	`sober_eval/main.py`
`custom\|aime25\|0\|0`	AIME 2025	`sober_eval/main.py`
`custom\|amc23\|0\|0`	AMC 2023	`sober_eval/main.py`
`custom\|math_500\|0\|0`	MATH-500	`sober_eval/main.py`
`custom\|minerva\|0\|0`	Minerva Math	`sober_eval/main.py`
`custom\|olympiadbench\|0\|0`	OlympiadBench	`sober_eval/main.py`
`custom\|gpqa:diamond\|0\|0`	GPQA-Diamond	`sober_eval/main.py`
`custom\|ifeval_no_thinking\|0\|0`	IFEval	`sober_eval/main.py`
`connect4`	Connect4 (OOD, dynamically generated)	`eval/eval_connect4.py`

Connect4 serves as an OOD reasoning benchmark with verifiable intermediate steps. Game states are dynamically generated via GAMEBoT to prevent data contamination, requiring a separate evaluation script.

Multi-Trial Evaluation

Loads the model once and duplicates the dataset N times — more efficient than re-running per seed:

bash run_evaluation_multi_trial_duplicated_data.sh /path/to/model --disable-thinking

Results are saved to evaluation_results/aggregated_results.json.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
eval		eval
scripts		scripts
sober_eval		sober_eval
.gitignore		.gitignore
README.md		README.md
run_evaluation_multi_trial_duplicated_data.sh		run_evaluation_multi_trial_duplicated_data.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SPoT: Surgical Post-Training

Data Pipeline

Step 1: Error Elicitation

Step 2: Oracle Rectification

Evaluation

Supported Benchmarks

Multi-Trial Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SPoT: Surgical Post-Training

Data Pipeline

Step 1: Error Elicitation

Step 2: Oracle Rectification

Evaluation

Supported Benchmarks

Multi-Trial Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages