FiRE: Enhancing MLLMs with Fine-Grained Context Learning for Complex Image Retrieval

Official implementation of FiRE, a fine-grained context learning framework for complex image retrieval.

Oral @ SIGIR'25

Authors

Bohan Hou¹, Haoqiang Lin¹, Xuemeng Song¹, Haokun Wen^2,4, Meng Liu³, Yupeng Hu¹, Xiangyu Zhao⁴*

¹ Shandong University
² Harbin Institute of Technology (Shenzhen)
³ Shandong Jianzhu University
⁴ City University of Hong Kong
* Corresponding author

Links

Paper: ACM Digital Library
Code Repository: GitHub

Updates

[07/2025] Paper accepted at SIGIR 2025
[04/2026] Initial open-source release

fire_opensource_clean/
├── README.md
├── README.zh-CN.md
├── requirements.txt
├── configs/
│   ├── train_stage2.example.yaml
│   └── eval.example.yaml
├── docs/
│   └── cleanup_notes.md
├── scripts/
│   ├── train.py
│   └── eval.py
└── src/fire_open/
    ├── __init__.py
    ├── config.py
    ├── datasets.py
    ├── losses.py
    ├── modeling.py
    └── trainer.py

Installation

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Quick Start

Train

Edit the training config first:

configs/train_stage2.example.yaml

Then run:

python scripts/train.py --config configs/train_stage2.example.yaml

Evaluate

Edit the evaluation config first:

configs/eval.example.yaml

Then run:

python scripts/eval.py --config configs/eval.example.yaml

Example output:

{
  "Recall@1": 0.23,
  "Recall@5": 0.51,
  "Recall@10": 0.64,
  "Recall@50": 0.88
}

Configuration Overview

Training config

The example training config exposes the main public knobs:

seed: 42
model:
  base_model: Salesforce/xgen-mm-phi3-mini-instruct-interleave-r-v1.5
  adapter_path: null
  freeze_base_model: true
  lora:
    enabled: true
    r: 64
    alpha: 128
    dropout: 0.1

data:
  task: custom_jsonl
  image_root: ./data/images
  train_metadata: ./data/annotations/fire_train.jsonl
  max_short_edge: 380
  num_workers: 4

training:
  output_dir: ./outputs/fire_stage2
  per_device_train_batch_size: 4
  gradient_accumulation_steps: 1
  num_train_epochs: 2
  learning_rate: 1.0e-4
  weight_decay: 0.01
  warmup_steps: 3000
  logging_steps: 50
  save_every_steps: 1000
  max_length: 512
  loss_scale: 100.0
  recall_loss_weight_at_1: 0.4
  recall_loss_weight_at_5: 0.15

Evaluation config

seed: 42
model:
  base_model: Salesforce/xgen-mm-phi3-mini-instruct-interleave-r-v1.5
  adapter_path: ./outputs/fire_stage2

data:
  task: custom_jsonl   # custom_jsonl | fashioniq | cirr
  image_root: ./data/images
  query_metadata: ./data/annotations/eval_queries.jsonl
  gallery_metadata: ./data/annotations/eval_gallery.jsonl
  split: val
  dress_type: dress
  max_short_edge: 380
  num_workers: 4

eval:
  batch_size: 16
  max_length: 512
  ks: [1, 5, 10, 50]
  exclude_reference: true

Data Format

Training data (`jsonl`)

Recommended format: one sample per line.

{
  "sample_id": "000001",
  "reference_image": "train/ref/0001.jpg",
  "target_image": "train/tgt/0001.jpg",
  "reference_id": "ref_0001",
  "target_id": "tgt_0001",
  "modification": "change the red shirt into a blue striped shirt",
  "reference_caption": "a person wearing a plain red shirt",
  "target_caption": "a person wearing a blue striped shirt"
}

Required fields:

reference_image
target_image
modification

Optional fields:

reference_caption
target_caption
reference_id
target_id
sample_id

If captions are unavailable, they can be left empty, though prompt quality may be weaker.

Evaluation query metadata (`jsonl`)

{
  "query_id": "q1",
  "reference_image": "eval/ref/001.jpg",
  "reference_id": "img_001",
  "modification": "make the bag black and remove the logo",
  "target_id": "img_128",
  "exclude_ids": ["img_001"]
}

Evaluation gallery metadata (`jsonl`)

{
  "image_id": "img_128",
  "image_path": "eval/gallery/128.jpg"
}

By design, the default evaluation path does not consume extra target-side text such as target_caption.

Supported Evaluation Tasks

FashionIQ example

data:
  task: fashioniq
  image_root: ./data/fashion_iq_data
  split: val
  dress_type: dress

CIRR example

data:
  task: cirr
  image_root: ./data/CIRR
  split: val

Citation

@inproceedings{hou2025fire,
  author    = {Bohan Hou and Haoqiang Lin and Xuemeng Song and Haokun Wen and Meng Liu and Yupeng Hu and Xiangyu Zhao},
  title     = {FiRE: Enhancing MLLMs with Fine-Grained Context Learning for Complex Image Retrieval},
  booktitle = {Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval},
  pages     = {803--812},
  publisher = {ACM},
  year      = {2025},
  doi       = {10.1145/3726302.3729979}
}

License

This project is released under the Apache License 2.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FiRE: Enhancing MLLMs with Fine-Grained Context Learning for Complex Image Retrieval

Authors

Links

Updates

Installation

Quick Start

Train

Evaluate

Configuration Overview

Training config

Evaluation config

Data Format

Training data (`jsonl`)

Evaluation query metadata (`jsonl`)

Evaluation gallery metadata (`jsonl`)

Supported Evaluation Tasks

FashionIQ example

CIRR example

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
configs		configs
scripts		scripts
src/fire_open		src/fire_open
README.md		README.md
README.zh-CN.md		README.zh-CN.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

FiRE: Enhancing MLLMs with Fine-Grained Context Learning for Complex Image Retrieval

Authors

Links

Updates

Installation

Quick Start

Train

Evaluate

Configuration Overview

Training config

Evaluation config

Data Format

Training data (jsonl)

Evaluation query metadata (jsonl)

Evaluation gallery metadata (jsonl)

Supported Evaluation Tasks

FashionIQ example

CIRR example

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Training data (`jsonl`)

Evaluation query metadata (`jsonl`)

Evaluation gallery metadata (`jsonl`)

Packages