Official implementation of RANKVIDEO, a video-native reasoning reranker for text-to-video retrieval. RANKVIDEO explicitly reasons over query-video pairs using video content to assess relevance.
RANKVIDEO is trained using a two-stage curriculum:
- Perception-grounded SFT: The model learns to generate captions grounded in video content
- Reranking Training: Fine-tuning with pointwise, pairwise, and teacher distillation objectives
Given a query-video pair, RANKVIDEO predicts relevance by comparing log-probabilities of discrete answer tokens, producing a scalar relevance score.
git clone https://github.com/tskow99/RANKVIDEO-Reasoning-Reranker.git
cd RANKVIDEO-Reasoning-Rerankerconda create -n RankVideo python=3.12 pip -y
conda activate RankVideoBelow is the minimum set up dependencies that should allow for running inference code.
# Upgrade pip and core tools
python -m pip install -U pip setuptools wheel
# Install ffmpeg via conda
conda install -c conda-forge -y ffmpeg
# Install PyTorch with CUDA 12.8 support
pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128
# Install vLLM for efficient inference
pip install vllm==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu128
# Install remaining dependencies
pip install transformers==4.57.1
pip install "qwen-vl-utils[decord]==0.0.14"
pip install jsonlines==4.0.0 tqdm==4.67.1
pip install pytrec_eval-terrier
pip install trl
pip install peft
pip install bitsandbytesBefore running, update the paths in config.py:
# Model paths
BASE_MODEL_ID = "Qwen/Qwen3-VL-8B-Instruct"
HF_MODEL_PATH = "hltcoe/RankVideo"
CACHE_DIR = "/path/to/model/cache"
# Data paths
TRAIN_DATA_PATH = "/path/to/train_data.jsonl"
EVAL_DATA_PATH = "/path/to/eval_data.jsonl"
VIDEO_DIR = "/path/to/videos"Download pretrained RANKVIDEO weights from HuggingFace:
from rankvideo import VLMReranker
reranker = VLMReranker(
model_path="hltcoe/RankVideo",
cache_dir="/path/to/cache"
)Score query-video pairs for relevance:
from rankvideo import VLMReranker
reranker = VLMReranker(model_path="hltcoe/RankVideo")
scores = reranker.score_batch(
queries=["example query 1", "example query2"],
video_paths=["/path/to/video1.mp4", "/path/to/video2.mp4"],
)
for score in scores:
print(f"P(relevant) = {score['p_yes']:.3f}")
print(f"Logit delta = {score['logit_delta']:.3f}")python -m rankvideo.rerank \
--model hltcoe/RankVideo \
--video2queries data/video2queries.json \
--query-mapping data/queries.tsv \
--video-dir /path/to/videos \
--output-dir outputs/rerankingpython -m rankvideo.evaluate \
--pred-dir outputs/reranking \
--first-stage data/first_stage_results.json \
--qrels data/qrels.txtRANKVIDEO includes scripts for generating training data:
Generate perception-grounded captions for videos using a VLM:
python -m rankvideo.generate_captions \
--video-root /path/to/videos \
--output data/captions.jsonl \
--model-name Qwen/Qwen3-Omni-30B-A3B-InstructGenerate reasoning traces and soft labels from a teacher model for distillation:
python -m rankvideo.generate_reasoning \
--queries data/queries.tsv \
--candidates data/candidates.trec \
--captions-jsonl data/captions.jsonl \
--output data/teacher_labels.jsonl \
--model-path /path/to/reasoning/model \
--topk 20python -m rankvideo.train_perception \
--model Qwen/Qwen3-VL-8B-Instruct \
--train-data /path/to/caption_data.jsonl \
--eval-data /path/to/caption_eval.jsonl \
--output-dir outputs/stage1python -m rankvideo.train_reranker \
--model outputs/stage1/perception-TIMESTAMP \
--train-data /path/to/ranking_data.jsonl \
--eval-data /path/to/ranking_eval.jsonl \
--output-dir outputs/stage2Each line should be a JSON object with:
{
"query_id": "q001",
"query": "person playing guitar on stage",
"doc_id": "video_123",
"videos": ["/path/to/video_123.mp4"],
"true_label": 1,
"teacher_p_yes": 0.85,
"evidence": {
"caption": "A musician performs with an acoustic guitar...",
"asr": "transcribed speech if available"
}
}query_id query_text
q001 person playing guitar on stage
q002 dog running in park
{
"q001": {"video_123": 0.95, "video_456": 0.82, ...},
"q002": {"video_789": 0.91, ...}
}q001 0 video_123 1
q001 0 video_456 0
We provide preprocessed data files to reproduce our results. Extract the archive:
tar -xzvf data.tar.gzThis creates the following structure:
data/
├── training_data.json # Training data with teacher labels
├── videos2queriesranking_AV_OmniEmbed.json # Video-to-query candidate mapping
└── first_stage_results/
└── ranking_AV_OmniEmbed.json # First-stage retrieval scores
Training examples with teacher reasoning traces for distillation. Each line contains:
query_id,doc_id: Identifiers for the query-video pairquery: The text queryevidence: Containscaption(video description) andasr(speech transcript)teacher_reasoning: Reasoning trace from the teacher model
Used for: Stage 2 reranker training (--train-data)
Maps each video ID to the list of query IDs it is a candidate for:
{"video_id": ["query_1", "query_2", ...], ...}Used for: Batch reranking (--video2queries in rerank.py)
First-stage retrieval scores from OmniEmbed. Maps query IDs to candidate videos with scores:
{"query_id": {"video_id": score, ...}, ...}Used for: Evaluation baseline and cascade reranking (--first-stage in evaluate.py)
-
Download MultiVENT 2.0 videos from multivent.github.io
-
Run reranking on the first-stage candidates:
python -m rankvideo.rerank \
--model hltcoe/RankVideo \
--video2queries data/videos2queriesranking_AV_OmniEmbed.json \
--video-dir /path/to/multivent/videos \
--output-dir outputs/reranking- Evaluate against the first-stage baseline:
python -m rankvideo.evaluate \
--pred-dir outputs/reranking \
--first-stage data/first_stage_results/ranking_AV_OmniEmbed.json \
--qrels /path/to/multivent/qrels.txtFor training from scratch, we use the MultiVENT 2.0 dataset, a large-scale multilingual video retrieval benchmark.
@misc{skow2026rankvideoreasoningrerankingtexttovideo,
title={RANKVIDEO: Reasoning Reranking for Text-to-Video Retrieval},
author={Tyler Skow and Alexander Martin and Benjamin Van Durme and Rama Chellappa and Reno Kriz},
year={2026},
eprint={2602.02444},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2602.02444},
}This project is released under the MIT License.