This project is a simple Spam/Ham text classification web app built with Python, NLTK, scikit-learn, and Streamlit.
The application preprocesses text messages, converts them into TF-IDF features, trains a Multinomial Naive Bayes classifier, and allows users to test new input messages directly from a Streamlit interface.
-
Text preprocessing with:
- Lowercasing
- Punctuation removal
- Tokenization
- Stopword removal
- Stemming
-
Text vectorization using TF-IDF
-
Spam/Ham classification using Multinomial Naive Bayes
-
Model evaluation with:
- Validation accuracy
- Test accuracy
-
Simple web interface using Streamlit
.
├── app.py
├── 2cls_spam_text_cls.csv
└── README.mdMake sure you have Python installed, then install the required libraries:
pip install streamlit pandas numpy nltk scikit-learnThis project uses NLTK resources for tokenization and stopword removal. The following packages are downloaded automatically in the code:
stopwordspunktpunkt_tab
The dataset file should be named:
2cls_spam_text_cls.csv
or link: https://drive.google.com/file/d/1N7rk-kfnDFIGMeX0ROVTjKh71gcgx-7R/viewIt must contain at least these two columns:
Message→ the text messageCategory→ the label (spamorham)
Example:
| Message | Category |
|---|---|
| Congratulations! You have won a free ticket | spam |
| Hi, are we still meeting tomorrow? | ham |
Each message goes through the following steps:
- Convert text to lowercase
- Remove punctuation
- Tokenize into words
- Remove English stopwords
- Apply stemming
After preprocessing, messages are transformed into numerical vectors using TF-IDF (Term Frequency - Inverse Document Frequency).
This helps reduce the influence of very common words and gives more importance to informative words.
The dataset is split into:
- 70% Training set
- 20% Validation set
- 10% Test set
A Multinomial Naive Bayes model is trained on the training set.
When the user enters a new message in the Streamlit app, the text is:
- preprocessed
- transformed using the trained TF-IDF vectorizer
- predicted as either spam or ham
Run the following command in the terminal:
streamlit run app.py