A video understanding benchmark for evaluating vision-language models on causal reasoning in pool table physics. The dataset contains multiple-choice questions about cue ball movements, requiring models to understand physical causality from video observations.
Causal Pool evaluates models on four question types:
- Descriptive: What happened in the video?
- Predictive: What will happen next?
- Counterfactual (velocity): What if the initial velocity changed?
- Counterfactual (position): What if the initial position changed?
Each question includes a video of pool table physics simulation and multiple-choice options about ball trajectories, wall collisions, and pocket outcomes.
uv syncFor SLURM cluster usage, see documents/dev.md for detailed setup instructions.
Datasets are organized in datasets/ with the following structure:
datasets/
└── <dataset_name>/
├── splits/
│ ├── train-{question_type}.jsonl
│ └── test-{question_type}.jsonl
├── shots/
│ └── <shot_id>/
│ └── video files (*-full.mp4, *-partial.mp4)
└── raw_qa.jsonl
Each dataset entry contains:
video: Shot identifierquestion: Question text with contextoptions: List of multiple-choice optionsground_truth: Correct answer(s) as indicesmetadata.question_type: One of the four question types
Train models on the dataset using LoRA:
# Configure dataset and model in causal_pool/sft/train.py
python causal_pool/sft/train.pyOr use the SLURM script:
sbatch scripts/sft.shTraining supports:
- Counterfactual training (position/velocity)
- Descriptive training
- LoRA with configurable rank and target modules
- Custom trainer with per-question accuracy metrics
Evaluate a model on a dataset:
# Start vLLM server (see documents/dev.md for full setup)
# Then run evaluation:
python causal_pool/eval/eval.py \
--dataset <dataset_name> \
--base-url "http://localhost:8000/v1" \
--model "Qwen/Qwen3-VL-4B-Instruct" \
--num-samples 1 \
--max-concurrent 256 \
--max-tokens 10Use the automated evaluation script that handles vLLM server lifecycle:
# Via SLURM
sbatch scripts/run_eval.sh --model "Qwen/Qwen3-VL-4B-Instruct" --dataset <dataset_name>
# Directly on compute node
bash scripts/run_eval_direct.sh --model "Qwen/Qwen3-VL-4B-Instruct" --dataset <dataset_name>Evaluate multiple models configured in configs/eval/config.yaml:
python scripts/batch_eval.pyFor debugging individual questions:
python scripts/manual_eval.py -d <dataset_name> -m "Qwen/Qwen3-VL-4B-Instruct" -u http://localhost:8000/v1 --fps 15 -i 0The evaluation system uses Hydra-based configs in configs/eval/:
config.yaml: Main config specifying which models to evaluatemodels/: Model-specific configurationsvllm/: vLLM serving configurationseval/: Evaluation hyperparameters
See configs/eval/README.md for detailed configuration guide.
Two accuracy metrics are computed:
- Per-question accuracy: Exact match of predicted answer(s) with ground truth
- Per-option accuracy: Hamming distance between predicted and ground truth option sets
Results are saved as JSON files in results/ with detailed per-question breakdowns.
causal_pool/
├── causal_pool/ # Main package
│ ├── data/ # Dataset loading utilities
│ ├── eval/ # Evaluation scripts and utilities
│ ├── sft/ # Supervised fine-tuning code
│ ├── metrics.py # Accuracy metrics
│ └── prompt_utils.py # Prompt construction
├── configs/ # Hydra configuration files
├── datasets/ # Dataset files and splits
├── scripts/ # Utility scripts
│ ├── auto_eval.py # Automated evaluation orchestrator
│ ├── batch_eval.py # Batch evaluation submission
│ ├── manual_eval.py # Interactive evaluation
│ └── process_dataset.py # Dataset processing
├── outputs/ # Training outputs and logs
└── results/ # Evaluation results
scripts/auto_eval.py: Orchestrates vLLM server launch and evaluationscripts/batch_eval.py: Submits multiple evaluation jobsscripts/manual_eval.py: Interactive evaluation for debuggingscripts/process_dataset.py: Process raw QA data into train/test splitsscripts/merge_lora_ckpt.py: Merge LoRA checkpoints with base modelscripts/plot_category_metrics.py: Visualize evaluation results
- Qwen3-VL (4B, 8B, 32B, 30B-A3B Instruct/Thinking variants)
- Custom fine-tuned CausalPool models
- Python >= 3.12
- CUDA-capable GPU
- vLLM for model serving (via Apptainer on SLURM clusters)
- See
pyproject.tomlfor full dependency list
documents/dev.md: Detailed development and cluster usage guideconfigs/eval/README.md: Evaluation configuration guidetests/README.md: Testing documentation