Official implementation of FiRE, a fine-grained context learning framework for complex image retrieval.
Oral @ SIGIR'25
Bohan Hou1, Haoqiang Lin1, Xuemeng Song1, Haokun Wen2,4, Meng Liu3, Yupeng Hu1, Xiangyu Zhao4*
1 Shandong University
2 Harbin Institute of Technology (Shenzhen)
3 Shandong Jianzhu University
4 City University of Hong Kong
* Corresponding author
- Paper:
ACM Digital Library - Code Repository:
GitHub
- [07/2025] Paper accepted at SIGIR 2025
- [04/2026] Initial open-source release
fire_opensource_clean/
├── README.md
├── README.zh-CN.md
├── requirements.txt
├── configs/
│ ├── train_stage2.example.yaml
│ └── eval.example.yaml
├── docs/
│ └── cleanup_notes.md
├── scripts/
│ ├── train.py
│ └── eval.py
└── src/fire_open/
├── __init__.py
├── config.py
├── datasets.py
├── losses.py
├── modeling.py
└── trainer.py
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtEdit the training config first:
configs/train_stage2.example.yamlThen run:
python scripts/train.py --config configs/train_stage2.example.yamlEdit the evaluation config first:
configs/eval.example.yamlThen run:
python scripts/eval.py --config configs/eval.example.yamlExample output:
{
"Recall@1": 0.23,
"Recall@5": 0.51,
"Recall@10": 0.64,
"Recall@50": 0.88
}The example training config exposes the main public knobs:
seed: 42
model:
base_model: Salesforce/xgen-mm-phi3-mini-instruct-interleave-r-v1.5
adapter_path: null
freeze_base_model: true
lora:
enabled: true
r: 64
alpha: 128
dropout: 0.1
data:
task: custom_jsonl
image_root: ./data/images
train_metadata: ./data/annotations/fire_train.jsonl
max_short_edge: 380
num_workers: 4
training:
output_dir: ./outputs/fire_stage2
per_device_train_batch_size: 4
gradient_accumulation_steps: 1
num_train_epochs: 2
learning_rate: 1.0e-4
weight_decay: 0.01
warmup_steps: 3000
logging_steps: 50
save_every_steps: 1000
max_length: 512
loss_scale: 100.0
recall_loss_weight_at_1: 0.4
recall_loss_weight_at_5: 0.15seed: 42
model:
base_model: Salesforce/xgen-mm-phi3-mini-instruct-interleave-r-v1.5
adapter_path: ./outputs/fire_stage2
data:
task: custom_jsonl # custom_jsonl | fashioniq | cirr
image_root: ./data/images
query_metadata: ./data/annotations/eval_queries.jsonl
gallery_metadata: ./data/annotations/eval_gallery.jsonl
split: val
dress_type: dress
max_short_edge: 380
num_workers: 4
eval:
batch_size: 16
max_length: 512
ks: [1, 5, 10, 50]
exclude_reference: trueRecommended format: one sample per line.
{
"sample_id": "000001",
"reference_image": "train/ref/0001.jpg",
"target_image": "train/tgt/0001.jpg",
"reference_id": "ref_0001",
"target_id": "tgt_0001",
"modification": "change the red shirt into a blue striped shirt",
"reference_caption": "a person wearing a plain red shirt",
"target_caption": "a person wearing a blue striped shirt"
}Required fields:
reference_imagetarget_imagemodification
Optional fields:
reference_captiontarget_captionreference_idtarget_idsample_id
If captions are unavailable, they can be left empty, though prompt quality may be weaker.
{
"query_id": "q1",
"reference_image": "eval/ref/001.jpg",
"reference_id": "img_001",
"modification": "make the bag black and remove the logo",
"target_id": "img_128",
"exclude_ids": ["img_001"]
}{
"image_id": "img_128",
"image_path": "eval/gallery/128.jpg"
}By design, the default evaluation path does not consume extra target-side text such as target_caption.
data:
task: fashioniq
image_root: ./data/fashion_iq_data
split: val
dress_type: dressdata:
task: cirr
image_root: ./data/CIRR
split: val@inproceedings{hou2025fire,
author = {Bohan Hou and Haoqiang Lin and Xuemeng Song and Haokun Wen and Meng Liu and Yupeng Hu and Xiangyu Zhao},
title = {FiRE: Enhancing MLLMs with Fine-Grained Context Learning for Complex Image Retrieval},
booktitle = {Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval},
pages = {803--812},
publisher = {ACM},
year = {2025},
doi = {10.1145/3726302.3729979}
}This project is released under the Apache License 2.0.