Spam Text Classification with Streamlit and TF-IDF

This project is a simple Spam/Ham text classification web app built with Python, NLTK, scikit-learn, and Streamlit.

The application preprocesses text messages, converts them into TF-IDF features, trains a Multinomial Naive Bayes classifier, and allows users to test new input messages directly from a Streamlit interface.

Features

Text preprocessing with:
- Lowercasing
- Punctuation removal
- Tokenization
- Stopword removal
- Stemming
Text vectorization using TF-IDF
Spam/Ham classification using Multinomial Naive Bayes
Model evaluation with:
- Validation accuracy
- Test accuracy
Simple web interface using Streamlit

Project Structure

.
├── app.py
├── 2cls_spam_text_cls.csv
└── README.md

Requirements

Make sure you have Python installed, then install the required libraries:

pip install streamlit pandas numpy nltk scikit-learn

NLTK Resources

This project uses NLTK resources for tokenization and stopword removal. The following packages are downloaded automatically in the code:

stopwords
punkt
punkt_tab

Dataset

The dataset file should be named:

2cls_spam_text_cls.csv
or link: https://drive.google.com/file/d/1N7rk-kfnDFIGMeX0ROVTjKh71gcgx-7R/view

It must contain at least these two columns:

Message → the text message
Category → the label (spam or ham)

Example:

Message	Category
Congratulations! You have won a free ticket	spam
Hi, are we still meeting tomorrow?	ham

How It Works

1. Text Preprocessing

Each message goes through the following steps:

Convert text to lowercase
Remove punctuation
Tokenize into words
Remove English stopwords
Apply stemming

2. TF-IDF Vectorization

After preprocessing, messages are transformed into numerical vectors using TF-IDF (Term Frequency - Inverse Document Frequency).

This helps reduce the influence of very common words and gives more importance to informative words.

3. Model Training

The dataset is split into:

70% Training set
20% Validation set
10% Test set

A Multinomial Naive Bayes model is trained on the training set.

4. Prediction

When the user enters a new message in the Streamlit app, the text is:

preprocessed
transformed using the trained TF-IDF vectorizer
predicted as either spam or ham

Running the App

Run the following command in the terminal:

streamlit run app.py

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
text_classification		text_classification
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spam Text Classification with Streamlit and TF-IDF

Features

Project Structure

Requirements

NLTK Resources

Dataset

How It Works

1. Text Preprocessing

2. TF-IDF Vectorization

3. Model Training

4. Prediction

Running the App

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Spam Text Classification with Streamlit and TF-IDF

Features

Project Structure

Requirements

NLTK Resources

Dataset

How It Works

1. Text Preprocessing

2. TF-IDF Vectorization

3. Model Training

4. Prediction

Running the App

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages