Multi-Label Text Classifier

Overview

This project is a production-grade multi-label text classification pipeline that identifies multiple applicable categories for a given text. It provides an end-to-end blueprint covering data ingestion from GitHub, multi-hot encoded preprocessing, scalable cloud training with DistilBERT via Google Colab, experiment tracking with MLflow, and robust API serving using FastAPI and Docker.

Architecture Diagram

[ Raw Data Source ]
        │
        ▼
   ( Ingestion ) src/ingest.py
        │
        ▼
[ data/raw/arxiv_raw.csv ]
        │
        ▼
 ( Preprocessing ) src/preprocess.py
        │
        ├──► [ data/processed/label_encoder.pkl ]
        └──► [ data/processed/{train,val,test}.csv ]
                   │
                   ▼
      ( Colab Training ) notebooks/train_colab.ipynb  ──► [ MLflow Tracking ]
                   │
                   ▼
   [ models/distilbert_multilabel/ ]
                   │
                   ▼
         ( FastAPI Serving ) src/serve/main.py
                   │
                   ▼
           [ Docker Container ]

Tech Stack

Component	Technology
Deep Learning / NLP	PyTorch, Transformers
Data Processing	Pandas, Scikit-learn, iterstrat
Experiment Tracking	MLflow
Model Serving	FastAPI, Uvicorn, Pydantic
Evaluation	Evaluate, Scikit-learn
Containerization	Docker, Docker Compose

Project Structure

multi-label-text-clf/
├── .env.example
├── .gitignore
├── CONTRIBUTING.md
├── docker-compose.yml
├── Dockerfile
├── LICENSE
├── Makefile
├── README.md
├── requirements.txt
├── data/                        # gitignored — generated by pipeline
│   ├── processed/
│   └── raw/
├── models/                      # gitignored — downloaded after training
│   └── distilbert_multilabel/
├── mlruns/                      # gitignored — generated by MLflow
├── notebooks/
│   └── train_colab.ipynb
├── sample_outputs/
│   └── predictions_sample.json
├── scripts/
│   └── test_api.py
└── src/
    ├── __init__.py
    ├── config.py
    ├── ingest.py
    ├── preprocess.py
    └── serve/
        ├── __init__.py
        ├── main.py
        └── model.py

Setup

Clone the repository

   git clone https://github.com/anantha037/multi-label-text-clf.git
   cd multi-label-text-clf

Create a virtual environment

   python -m venv venv
   # On Windows:
   .\venv\Scripts\activate
   # On Unix/macOS:
   source venv/bin/activate

Install dependencies

   pip install -r requirements.txt

Running the Pipeline

Ingest Data Download the dataset and extract the top labels:

   make ingest

Preprocess Data Clean the data, generate multi-hot encoded labels, and split the dataset:

   make preprocess

Train the Model (Google Colab) Upload data/processed/*.csv to Google Colab and run notebooks/train_colab.ipynb.
Download Model Artifacts Once training completes, download the trained model directory from Google Drive into models/distilbert_multilabel/.
Start the API

   make serve

Test the API

   make test-api

Docker

Build the image:

   docker build -t multilabel-clf .

Run the container:

   docker-compose up

MLflow UI

make mlflow-ui
# or
mlflow ui --backend-store-uri mlruns

Open http://localhost:5000 to view experiment runs, metrics, and model artifacts.

API Reference

Method	Path	Description	Example Response
`GET`	`/health`	Health check and model load status	`{"status": "ok", "model_loaded": true}`
`GET`	`/labels`	Returns all supported label names	`["admiration", "amusement", ...]`
`POST`	`/predict`	Predicts labels for input text	`{"text": "...", "labels": ["gratitude"], "scores": {"gratitude": 0.98, ...}}`

Sample Predictions

Real outputs from the trained model — see sample_outputs/predictions_sample.json.

Input Text	Predicted Labels	Top Score
"I am so grateful and happy today"	gratitude	0.983
"This is absolutely infuriating and unfair"	annoyance	0.676
"I find this topic really fascinating"	admiration	0.656
"Thank you so much, this made my day"	gratitude	0.991
"I'm not sure how I feel about this"	—	neutral: 0.404

Results

Metric	Value
Test F1 Micro	0.6804
Test F1 Macro	0.6375
Hamming Loss	0.0648

Trained for 4 epochs on Google Colab T4 GPU. Best checkpoint saved at epoch 3 (Val F1 Micro: 0.6846).

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-Label Text Classifier

Overview

Architecture Diagram

Tech Stack

Project Structure

Setup

Running the Pipeline

Docker

MLflow UI

API Reference

Sample Predictions

Results

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github/workflows		.github/workflows
notebooks		notebooks
sample_outputs		sample_outputs
scripts		scripts
src		src
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Multi-Label Text Classifier

Overview

Architecture Diagram

Tech Stack

Project Structure

Setup

Running the Pipeline

Docker

MLflow UI

API Reference

Sample Predictions

Results

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages