This project investigates cross-modal alignment strategies for image captioning within a controlled encoder-decoder framework (ViT/CLIP → GPT-2). Four configurations are compared by varying the visual encoder (Vanilla ViT vs. CLIP ViT) and whether an explicit MLP projection module bridges the two modalities. All models are trained and evaluated on MS-COCO 2017 to isolate the effect of each alignment choice on caption quality and cross-domain generalization.
In encoder-decoder multimodal architectures, how does the choice of visual encoder (Vanilla ViT vs. CLIP ViT) and projection module (no mapper vs. MLP mapper) affect caption quality and cross-domain generalization?
| Experiment | Encoder | Mapper | Description |
|---|---|---|---|
vit_no_mapper |
Vanilla ViT | None | Baseline: implicit alignment via cross-attention |
vit_mlp_mapper |
Vanilla ViT | MLP | Explicit projection from visual to language space |
clip_no_mapper |
CLIP ViT | None | Pre-aligned visual features, no explicit bridge |
clip_mlp_mapper |
CLIP ViT | MLP | Pre-aligned features + explicit projection |
All models are trained on a 30k subset of MS-COCO train2017 and evaluated on COCO val2017 (5k images).
- BLEU-4 — 4-gram precision between generated and reference captions
- METEOR — Harmonic mean of precision and recall, accounts for synonyms
- CIDEr — Consensus-based similarity, weighted by term relevance
This project uses MS-COCO 2017. The datasets library can download it automatically when you run the notebook.
Automatic (recommended): Handled via HuggingFace datasets:
from datasets import load_dataset
ds = load_dataset("HuggingFaceM4/COCO", trust_remote_code=True)Manual download:
- Visit https://cocodataset.org/#download
- Download:
2017 Train images(~18 GB)2017 Val images(~1 GB)2017 Train/Val annotations
- Extract into a local directory and update the data path in
src/image_captioning/config.py
pip install -r requirements.txtOr manually:
pip install torch>=2.0.0 torchvision>=0.15.0
pip install transformers==4.44.0 accelerate datasets peft
pip install pycocoevalcap pycocotools
pip install pandas numpy matplotlib seaborn tqdm pillow scikit-learn nltkOpen and run all cells in image_captioning.ipynb. The notebook:
- Downloads and preprocesses the COCO dataset
- Trains all four model configurations sequentially
- Evaluates with BLEU-4, METEOR, and CIDEr
- Generates visualizations and comparison tables
jupyter notebook image_captioning.ipynbTo train and evaluate the CLIP + MLP mapper configuration independently:
python run_model3.pyImage_Caption_Generator/
├── src/
│ └── image_captioning/
│ ├── config.py # Hyperparameters and experiment configuration
│ ├── data.py # COCO data loading and Dataset class
│ ├── modeling.py # Model definitions (ViT/CLIP × No Mapper/MLP)
│ ├── training.py # Training loop
│ ├── evaluation.py # BLEU-4, METEOR, CIDEr evaluation
│ ├── generation.py # Caption generation (beam search)
│ └── visualization.py # Plots, attention heatmaps, result tables
├── image_captioning.ipynb # Main notebook: runs all experiments and figures
├── run_model3.py # Standalone script for CLIP + MLP mapper
└── requirements.txt
| Component | Model | Parameters | Notes |
|---|---|---|---|
| Encoder (baseline) | google/vit-base-patch16-224-in21k |
86M | ImageNet-21k pretrained |
| Encoder (CLIP) | openai/clip-vit-base-patch16 |
86M | CLIP-aligned visual features |
| Decoder | gpt2 |
124M | Cross-attention weights trained from scratch |
| MLP Mapper | Linear(768→1024) → ReLU → Dropout(0.1) → Linear(1024→768) → LayerNorm | — | Optional visual-to-language projection |
Encoders are frozen during training; only the decoder cross-attention and optional mapper are updated.
| Setting | Value |
|---|---|
| Dataset | MS-COCO 2017 |
| Train split | 30k subset of train2017 |
| Val split | val2017 (5k images) |
| Optimizer | AdamW |
| Learning rate | 5e-5 |
| Epochs | 5 |
| Batch size | 32 |
| Mixed precision | FP16 |
| Encoder | Frozen |
- Dosovitskiy et al. (2021). An Image is Worth 16x16 Words. ICLR. (ViT)
- Radford et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML. (CLIP)
- Radford et al. (2019). Language Models are Unsupervised Multitask Learners. (GPT-2)
- Mokady et al. (2021). ClipCap: CLIP Prefix for Image Captioning. arXiv:2111.09734.
- Li et al. (2023). BLIP-2. ICML. arXiv:2301.12597.
- Lin, T.-Y. et al. (2014). Microsoft COCO. ECCV. (COCO dataset)