This project is a production-grade multi-label text classification pipeline that identifies multiple applicable categories for a given text. It provides an end-to-end blueprint covering data ingestion from GitHub, multi-hot encoded preprocessing, scalable cloud training with DistilBERT via Google Colab, experiment tracking with MLflow, and robust API serving using FastAPI and Docker.
[ Raw Data Source ]
│
▼
( Ingestion ) src/ingest.py
│
▼
[ data/raw/arxiv_raw.csv ]
│
▼
( Preprocessing ) src/preprocess.py
│
├──► [ data/processed/label_encoder.pkl ]
└──► [ data/processed/{train,val,test}.csv ]
│
▼
( Colab Training ) notebooks/train_colab.ipynb ──► [ MLflow Tracking ]
│
▼
[ models/distilbert_multilabel/ ]
│
▼
( FastAPI Serving ) src/serve/main.py
│
▼
[ Docker Container ]
| Component | Technology |
|---|---|
| Deep Learning / NLP | PyTorch, Transformers |
| Data Processing | Pandas, Scikit-learn, iterstrat |
| Experiment Tracking | MLflow |
| Model Serving | FastAPI, Uvicorn, Pydantic |
| Evaluation | Evaluate, Scikit-learn |
| Containerization | Docker, Docker Compose |
multi-label-text-clf/
├── .env.example
├── .gitignore
├── CONTRIBUTING.md
├── docker-compose.yml
├── Dockerfile
├── LICENSE
├── Makefile
├── README.md
├── requirements.txt
├── data/ # gitignored — generated by pipeline
│ ├── processed/
│ └── raw/
├── models/ # gitignored — downloaded after training
│ └── distilbert_multilabel/
├── mlruns/ # gitignored — generated by MLflow
├── notebooks/
│ └── train_colab.ipynb
├── sample_outputs/
│ └── predictions_sample.json
├── scripts/
│ └── test_api.py
└── src/
├── __init__.py
├── config.py
├── ingest.py
├── preprocess.py
└── serve/
├── __init__.py
├── main.py
└── model.py
- Clone the repository
git clone https://github.com/anantha037/multi-label-text-clf.git
cd multi-label-text-clf- Create a virtual environment
python -m venv venv
# On Windows:
.\venv\Scripts\activate
# On Unix/macOS:
source venv/bin/activate- Install dependencies
pip install -r requirements.txt- Ingest Data Download the dataset and extract the top labels:
make ingest- Preprocess Data Clean the data, generate multi-hot encoded labels, and split the dataset:
make preprocess- Train the Model (Google Colab)
Upload
data/processed/*.csvto Google Colab and runnotebooks/train_colab.ipynb. - Download Model Artifacts
Once training completes, download the trained model directory from Google Drive into
models/distilbert_multilabel/. - Start the API
make serve- Test the API
make test-api- Build the image:
docker build -t multilabel-clf .- Run the container:
docker-compose upmake mlflow-ui
# or
mlflow ui --backend-store-uri mlrunsOpen http://localhost:5000 to view experiment runs, metrics, and model artifacts.
| Method | Path | Description | Example Response |
|---|---|---|---|
GET |
/health |
Health check and model load status | {"status": "ok", "model_loaded": true} |
GET |
/labels |
Returns all supported label names | ["admiration", "amusement", ...] |
POST |
/predict |
Predicts labels for input text | {"text": "...", "labels": ["gratitude"], "scores": {"gratitude": 0.98, ...}} |
Real outputs from the trained model — see sample_outputs/predictions_sample.json.
| Input Text | Predicted Labels | Top Score |
|---|---|---|
| "I am so grateful and happy today" | gratitude | 0.983 |
| "This is absolutely infuriating and unfair" | annoyance | 0.676 |
| "I find this topic really fascinating" | admiration | 0.656 |
| "Thank you so much, this made my day" | gratitude | 0.991 |
| "I'm not sure how I feel about this" | — | neutral: 0.404 |
| Metric | Value |
|---|---|
| Test F1 Micro | 0.6804 |
| Test F1 Macro | 0.6375 |
| Hamming Loss | 0.0648 |
Trained for 4 epochs on Google Colab T4 GPU. Best checkpoint saved at epoch 3 (Val F1 Micro: 0.6846).
MIT