This project implements an image captioning system using a Convolutional Neural Network (CNN) as an encoder and a Recurrent Neural Network (LSTM) as a decoder. The system generates captions for images by encoding the visual features using a CNN and decoding them into text using an LSTM.
pip install -r requirements.txtDownload the pre-trained encoder and decoder model weights and place them in the models directory.
Download the COCO dataset and place the images in the cocoapi/images/val2017/ directory. Ensure the image paths are correctly referenced in the code.
This project uses the COCO dataset for image captioning. The dataset contains a wide variety of images with detailed annotations. The val2017 subset is used for testing the model.
yourprojectname/
│
├── data/
│ └── vocab.pkl # Vocabulary file
├── models/
│ ├── encoder-3.pkl # Pre-trained encoder model
│ └── decoder-3.pkl # Pre-trained decoder model
├── cocoapi/
│ └── images/
│ └── val2017/ # Directory containing validation images
├── inference.ipynb # Jupyter notebook for running inference
└── README.md # Project README fileThe encoder is a pre-trained CNN (e.g., ResNet) that extracts visual features from the input image. These features are then passed to the decoder.
The decoder is an RNN (specifically, an LSTM) that takes the encoded image features as input and generates a caption. The LSTM uses a vocabulary file to convert model outputs into human-readable text.
The training script trains the encoder-decoder model on the COCO dataset. Due to the computational resources required, training is typically performed on a GPU-enabled machine.
To train the model, use the following command (this is an example; modify according to your needs):
python train.py --epochs 10 --batch-size 32 --learning-rate 0.001 --embed-size 256 --hidden-size 512You can generate captions for images using the pre-trained model. The inference.ipynb notebook provides an example of how to do this.
- Open the
inference.ipynbnotebook. - Follow the instructions in the notebook to load an image, run the model, and generate a caption.
# Load the image
test_image_path = "cocoapi/images/val2017/000000000785.jpg"
image = Image.open(test_image_path)
# Generate a caption
caption = generate_caption(image)
print(f"Generated Caption: {caption}")The model generates captions like "A group of people sitting at a table with a laptop." These captions are generated based on the visual features extracted by the encoder.
Contributions are welcome! Please open an issue or submit a pull request if you have any improvements or fixes.
This project is licensed under the MIT License - see the LICENSE file for details.