Email Spam Classifier

1. Introduction

This project presents a robust email spam classification system developed using machine learning techniques. The primary objective is to accurately distinguish between legitimate (ham) and unsolicited (spam) emails, thereby enhancing user experience and mitigating security risks associated with phishing and malicious content. The classifier employs a Logistic Regression model trained on a vectorized dataset of email messages.

2. Dataset

The dataset utilized for this project is mail_data.csv, which comprises a collection of email messages labeled as either 'spam' or 'ham'.

Dataset Characteristics:

Total Entries: 5572
Features:
- Category: The label indicating whether an email is 'spam' or 'ham'.
- Message: The textual content of the email.
No Missing Values: The dataset was pre-processed to ensure no missing entries, replacing any potential null values with empty strings to maintain data integrity.

3. Methodology

The development of the spam classifier followed a standard machine learning pipeline:

3.1. Data Pre-processing

Handling Missing Values: Null values in the dataset were replaced with empty strings to prevent errors during vectorization.
Label Encoding: The categorical labels ('spam', 'ham') were converted into numerical representations (0 for spam, 1 for ham) to facilitate model training.

3.2. Data Splitting

The dataset was divided into training and testing sets to evaluate the model's generalization capabilities:

Training Set: 80% of the data, used to train the Logistic Regression model.
Test Set: 20% of the data, reserved for evaluating the trained model's performance on unseen data.
Stratified Sampling: The stratify parameter was used during the split to ensure that the proportion of spam and ham emails was preserved in both training and testing sets, crucial for maintaining balanced class representation.

3.3. Text Vectorization

Email messages, being textual data, were transformed into numerical feature vectors using TfidfVectorizer. This technique converts text into a matrix of TF-IDF (Term Frequency-Inverse Document Frequency) features, which reflects the importance of a word in a document relative to the entire corpus.

Parameters:
- min_df=1: Considers words that appear in at least one document.
- stop_words='english': Removes common English stop words (e.g., 'the', 'is', 'a') to reduce noise and computational load.
- lowercase=True: Converts all text to lowercase, standardizing words and reducing dimensionality.

3.4. Model Training

A Logistic Regression model was chosen for its interpretability and effectiveness in binary classification tasks. The model was trained on the TF-IDF vectorized training data (X_train_features) and corresponding labels (Y_train).

4. Results

The trained model's performance was evaluated on both the training and test datasets using accuracy as the primary metric.

Accuracy on Training Data: 96.679%
Accuracy on Test Data: 97.130%

These results indicate that the model performs well, demonstrating a high degree of accuracy in classifying emails as spam or ham, with comparable performance on both seen and unseen data, suggesting good generalization and minimal overfitting.

5. How to Use

To use this email spam classifier, follow these steps:

Clone the Repository:

git clone <repository_url>
cd <repository_name>

Install Dependencies: Ensure you have the necessary Python libraries installed. You can install them using pip:
```
pip install pandas scikit-learn joblib
```
Run the Notebook: Open and run the Jupyter Notebook (or Google Colab notebook) provided in the repository. The notebook includes all the steps for data loading, pre-processing, model training, and evaluation.

Make Predictions (Example):

import joblib
from sklearn.feature_extraction.text import TfidfVectorizer

# Load the trained model and feature extractor
model = joblib.load('trained_model.pkl')

# To load the feature_extraction object, you need to re-initialize it with the same parameters
# or save it during training.
# For simplicity, assuming the feature_extraction object was saved or can be recreated with identical parameters.
# In a real-world scenario, you would save and load this object as well.
feature_extraction = TfidfVectorizer(min_df=1, stop_words='english', lowercase=True)

# Fit on some data (ideally the original X_train) to reconstruct the vocabulary
# For demonstration, let's assume you have access to the original training messages or a similar corpus.
# If you only saved the model, you'd need to re-fit TFIDF on representative text or save the fitted vectorizer.
# As an example, we'll re-fit it on X_train if it was available, or a dummy list if not.
# In this notebook, feature_extraction was fit globally, so we can re-create and use it.
# If the `feature_extraction` object itself was saved, you would load it using joblib.load().

# Placeholder: In a deployed application, you'd load the fitted TfidfVectorizer
# For this example, let's assume we refit it (not ideal for deployment) or had it saved.
# Since the `feature_extraction` object is in the current kernel state, we can use it directly:
# feature_extraction is already defined and fitted in the notebook state.

# New email message to predict
input_mail = ["Free entry in 2 a wkly comp to win FA Cup final tickets 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"]

# Convert text to feature vectors
input_data_features = feature_extraction.transform(input_mail)

# Make prediction
prediction = model.predict(input_data_features)

if (prediction[0]==1):
    print('Ham mail')
else:
    print('Spam mail')

6. Model Persistence

The trained Logistic Regression model is saved as trained_model.pkl using the joblib library. This allows for easy deployment and reuse of the model without needing to retrain it every time.

7. License

This project is open-source and available under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Email_Spam_Classifier.ipynb		Email_Spam_Classifier.ipynb
README.md		README.md
mail_data.csv		mail_data.csv
trained_model.pkl		trained_model.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Email Spam Classifier

1. Introduction

2. Dataset

3. Methodology

3.1. Data Pre-processing

3.2. Data Splitting

3.3. Text Vectorization

3.4. Model Training

4. Results

5. How to Use

6. Model Persistence

7. License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Email Spam Classifier

1. Introduction

2. Dataset

3. Methodology

3.1. Data Pre-processing

3.2. Data Splitting

3.3. Text Vectorization

3.4. Model Training

4. Results

5. How to Use

6. Model Persistence

7. License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages