Skip to content

anantha037/multi-label-text-clf

Repository files navigation

Multi-Label Text Classifier

Python PyTorch HuggingFace MLflow FastAPI Docker CI

Overview

This project is a production-grade multi-label text classification pipeline that identifies multiple applicable categories for a given text. It provides an end-to-end blueprint covering data ingestion from GitHub, multi-hot encoded preprocessing, scalable cloud training with DistilBERT via Google Colab, experiment tracking with MLflow, and robust API serving using FastAPI and Docker.

Architecture Diagram

[ Raw Data Source ]
        │
        ▼
   ( Ingestion ) src/ingest.py
        │
        ▼
[ data/raw/arxiv_raw.csv ]
        │
        ▼
 ( Preprocessing ) src/preprocess.py
        │
        ├──► [ data/processed/label_encoder.pkl ]
        └──► [ data/processed/{train,val,test}.csv ]
                   │
                   ▼
      ( Colab Training ) notebooks/train_colab.ipynb  ──► [ MLflow Tracking ]
                   │
                   ▼
   [ models/distilbert_multilabel/ ]
                   │
                   ▼
         ( FastAPI Serving ) src/serve/main.py
                   │
                   ▼
           [ Docker Container ]

Tech Stack

Component Technology
Deep Learning / NLP PyTorch, Transformers
Data Processing Pandas, Scikit-learn, iterstrat
Experiment Tracking MLflow
Model Serving FastAPI, Uvicorn, Pydantic
Evaluation Evaluate, Scikit-learn
Containerization Docker, Docker Compose

Project Structure

multi-label-text-clf/
├── .env.example
├── .gitignore
├── CONTRIBUTING.md
├── docker-compose.yml
├── Dockerfile
├── LICENSE
├── Makefile
├── README.md
├── requirements.txt
├── data/                        # gitignored — generated by pipeline
│   ├── processed/
│   └── raw/
├── models/                      # gitignored — downloaded after training
│   └── distilbert_multilabel/
├── mlruns/                      # gitignored — generated by MLflow
├── notebooks/
│   └── train_colab.ipynb
├── sample_outputs/
│   └── predictions_sample.json
├── scripts/
│   └── test_api.py
└── src/
    ├── __init__.py
    ├── config.py
    ├── ingest.py
    ├── preprocess.py
    └── serve/
        ├── __init__.py
        ├── main.py
        └── model.py

Setup

  1. Clone the repository
   git clone https://github.com/anantha037/multi-label-text-clf.git
   cd multi-label-text-clf
  1. Create a virtual environment
   python -m venv venv
   # On Windows:
   .\venv\Scripts\activate
   # On Unix/macOS:
   source venv/bin/activate
  1. Install dependencies
   pip install -r requirements.txt

Running the Pipeline

  1. Ingest Data Download the dataset and extract the top labels:
   make ingest
  1. Preprocess Data Clean the data, generate multi-hot encoded labels, and split the dataset:
   make preprocess
  1. Train the Model (Google Colab) Upload data/processed/*.csv to Google Colab and run notebooks/train_colab.ipynb.
  2. Download Model Artifacts Once training completes, download the trained model directory from Google Drive into models/distilbert_multilabel/.
  3. Start the API
   make serve
  1. Test the API
   make test-api

Docker

  1. Build the image:
   docker build -t multilabel-clf .
  1. Run the container:
   docker-compose up

MLflow UI

make mlflow-ui
# or
mlflow ui --backend-store-uri mlruns

Open http://localhost:5000 to view experiment runs, metrics, and model artifacts.

API Reference

Method Path Description Example Response
GET /health Health check and model load status {"status": "ok", "model_loaded": true}
GET /labels Returns all supported label names ["admiration", "amusement", ...]
POST /predict Predicts labels for input text {"text": "...", "labels": ["gratitude"], "scores": {"gratitude": 0.98, ...}}

Sample Predictions

Real outputs from the trained model — see sample_outputs/predictions_sample.json.

Input Text Predicted Labels Top Score
"I am so grateful and happy today" gratitude 0.983
"This is absolutely infuriating and unfair" annoyance 0.676
"I find this topic really fascinating" admiration 0.656
"Thank you so much, this made my day" gratitude 0.991
"I'm not sure how I feel about this" neutral: 0.404

Results

Metric Value
Test F1 Micro 0.6804
Test F1 Macro 0.6375
Hamming Loss 0.0648

Trained for 4 epochs on Google Colab T4 GPU. Best checkpoint saved at epoch 3 (Val F1 Micro: 0.6846).

License

MIT

About

Production-grade multi-label text classification pipeline — DistilBERT, MLflow, FastAPI, Docker

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors