Image Captioning with Encoder-Decoder(LSTM) Model

This project implements an image captioning system using a Convolutional Neural Network (CNN) as an encoder and a Recurrent Neural Network (LSTM) as a decoder. The system generates captions for images by encoding the visual features using a CNN and decoding them into text using an LSTM.

Sample output generated

Install the required dependencies:

pip install -r requirements.txt

Download the Pre-trained Models:

Download the pre-trained encoder and decoder model weights and place them in the models directory.

Download the Dataset:

Download the COCO dataset and place the images in the cocoapi/images/val2017/ directory. Ensure the image paths are correctly referenced in the code.

Dataset

This project uses the COCO dataset for image captioning. The dataset contains a wide variety of images with detailed annotations. The val2017 subset is used for testing the model.

Directory Structure

yourprojectname/
│
├── data/
│   └── vocab.pkl              # Vocabulary file
├── models/
│   ├── encoder-3.pkl          # Pre-trained encoder model
│   └── decoder-3.pkl          # Pre-trained decoder model
├── cocoapi/
│   └── images/
│       └── val2017/           # Directory containing validation images
├── inference.ipynb            # Jupyter notebook for running inference
└── README.md                  # Project README file

Model Architecture

Encoder

The encoder is a pre-trained CNN (e.g., ResNet) that extracts visual features from the input image. These features are then passed to the decoder.

Decoder

The decoder is an RNN (specifically, an LSTM) that takes the encoded image features as input and generates a caption. The LSTM uses a vocabulary file to convert model outputs into human-readable text.

Training

The training script trains the encoder-decoder model on the COCO dataset. Due to the computational resources required, training is typically performed on a GPU-enabled machine.

Training Command

To train the model, use the following command (this is an example; modify according to your needs):

python train.py --epochs 10 --batch-size 32 --learning-rate 0.001 --embed-size 256 --hidden-size 512

Inference

You can generate captions for images using the pre-trained model. The inference.ipynb notebook provides an example of how to do this.

Running Inference

Open the inference.ipynb notebook.
Follow the instructions in the notebook to load an image, run the model, and generate a caption.

Example Usage

# Load the image
test_image_path = "cocoapi/images/val2017/000000000785.jpg"
image = Image.open(test_image_path)

# Generate a caption
caption = generate_caption(image)
print(f"Generated Caption: {caption}")

Results

The model generates captions like "A group of people sitting at a table with a laptop." These captions are generated based on the visual features extracted by the encoder.

Contributing

Contributions are welcome! Please open an issue or submit a pull request if you have any improvements or fixes.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
models		models
README.md		README.md
Training.ipynb		Training.ipynb
coco_dataset.py		coco_dataset.py
data_loader.py		data_loader.py
data_loader_val.py		data_loader_val.py
decoder.ipynb		decoder.ipynb
encoder.ipynb		encoder.ipynb
explore_dataset.ipynb		explore_dataset.ipynb
inference.ipynb		inference.ipynb
model.py		model.py
nlp_utils.py		nlp_utils.py
requirements.txt		requirements.txt
training_log.txt		training_log.txt
vocabulary.py		vocabulary.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Image Captioning with Encoder-Decoder(LSTM) Model

Sample output generated

Install the required dependencies:

Download the Pre-trained Models:

Download the Dataset:

Dataset

Directory Structure

Model Architecture

Encoder

Decoder

Training

Training Command

Inference

Running Inference

Example Usage

Results

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Image Captioning with Encoder-Decoder(LSTM) Model

Sample output generated

Install the required dependencies:

Download the Pre-trained Models:

Download the Dataset:

Dataset

Directory Structure

Model Architecture

Encoder

Decoder

Training

Training Command

Inference

Running Inference

Example Usage

Results

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages