Skip to content

Platypus27-coder/text_classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

Spam Text Classification with Streamlit and TF-IDF

This project is a simple Spam/Ham text classification web app built with Python, NLTK, scikit-learn, and Streamlit.

The application preprocesses text messages, converts them into TF-IDF features, trains a Multinomial Naive Bayes classifier, and allows users to test new input messages directly from a Streamlit interface.

Features

  • Text preprocessing with:

    • Lowercasing
    • Punctuation removal
    • Tokenization
    • Stopword removal
    • Stemming
  • Text vectorization using TF-IDF

  • Spam/Ham classification using Multinomial Naive Bayes

  • Model evaluation with:

    • Validation accuracy
    • Test accuracy
  • Simple web interface using Streamlit

Project Structure

.
├── app.py
├── 2cls_spam_text_cls.csv
└── README.md

Requirements

Make sure you have Python installed, then install the required libraries:

pip install streamlit pandas numpy nltk scikit-learn

NLTK Resources

This project uses NLTK resources for tokenization and stopword removal. The following packages are downloaded automatically in the code:

  • stopwords
  • punkt
  • punkt_tab

Dataset

The dataset file should be named:

2cls_spam_text_cls.csv
or link: https://drive.google.com/file/d/1N7rk-kfnDFIGMeX0ROVTjKh71gcgx-7R/view

It must contain at least these two columns:

  • Message → the text message
  • Category → the label (spam or ham)

Example:

Message Category
Congratulations! You have won a free ticket spam
Hi, are we still meeting tomorrow? ham

How It Works

1. Text Preprocessing

Each message goes through the following steps:

  1. Convert text to lowercase
  2. Remove punctuation
  3. Tokenize into words
  4. Remove English stopwords
  5. Apply stemming

2. TF-IDF Vectorization

After preprocessing, messages are transformed into numerical vectors using TF-IDF (Term Frequency - Inverse Document Frequency).

This helps reduce the influence of very common words and gives more importance to informative words.

3. Model Training

The dataset is split into:

  • 70% Training set
  • 20% Validation set
  • 10% Test set

A Multinomial Naive Bayes model is trained on the training set.

4. Prediction

When the user enters a new message in the Streamlit app, the text is:

  • preprocessed
  • transformed using the trained TF-IDF vectorizer
  • predicted as either spam or ham

Running the App

Run the following command in the terminal:

streamlit run app.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages