This project presents a robust email spam classification system developed using machine learning techniques. The primary objective is to accurately distinguish between legitimate (ham) and unsolicited (spam) emails, thereby enhancing user experience and mitigating security risks associated with phishing and malicious content. The classifier employs a Logistic Regression model trained on a vectorized dataset of email messages.
The dataset utilized for this project is mail_data.csv, which comprises a collection of email messages labeled as either 'spam' or 'ham'.
Dataset Characteristics:
- Total Entries: 5572
- Features:
Category: The label indicating whether an email is 'spam' or 'ham'.Message: The textual content of the email.
- No Missing Values: The dataset was pre-processed to ensure no missing entries, replacing any potential null values with empty strings to maintain data integrity.
The development of the spam classifier followed a standard machine learning pipeline:
- Handling Missing Values: Null values in the dataset were replaced with empty strings to prevent errors during vectorization.
- Label Encoding: The categorical labels ('spam', 'ham') were converted into numerical representations (0 for spam, 1 for ham) to facilitate model training.
The dataset was divided into training and testing sets to evaluate the model's generalization capabilities:
- Training Set: 80% of the data, used to train the Logistic Regression model.
- Test Set: 20% of the data, reserved for evaluating the trained model's performance on unseen data.
- Stratified Sampling: The
stratifyparameter was used during the split to ensure that the proportion of spam and ham emails was preserved in both training and testing sets, crucial for maintaining balanced class representation.
Email messages, being textual data, were transformed into numerical feature vectors using TfidfVectorizer. This technique converts text into a matrix of TF-IDF (Term Frequency-Inverse Document Frequency) features, which reflects the importance of a word in a document relative to the entire corpus.
- Parameters:
min_df=1: Considers words that appear in at least one document.stop_words='english': Removes common English stop words (e.g., 'the', 'is', 'a') to reduce noise and computational load.lowercase=True: Converts all text to lowercase, standardizing words and reducing dimensionality.
A Logistic Regression model was chosen for its interpretability and effectiveness in binary classification tasks. The model was trained on the TF-IDF vectorized training data (X_train_features) and corresponding labels (Y_train).
The trained model's performance was evaluated on both the training and test datasets using accuracy as the primary metric.
- Accuracy on Training Data: 96.679%
- Accuracy on Test Data: 97.130%
These results indicate that the model performs well, demonstrating a high degree of accuracy in classifying emails as spam or ham, with comparable performance on both seen and unseen data, suggesting good generalization and minimal overfitting.
To use this email spam classifier, follow these steps:
-
Clone the Repository:
git clone <repository_url> cd <repository_name>
-
Install Dependencies: Ensure you have the necessary Python libraries installed. You can install them using
pip:pip install pandas scikit-learn joblib
-
Run the Notebook: Open and run the Jupyter Notebook (or Google Colab notebook) provided in the repository. The notebook includes all the steps for data loading, pre-processing, model training, and evaluation.
-
Make Predictions (Example):
import joblib from sklearn.feature_extraction.text import TfidfVectorizer # Load the trained model and feature extractor model = joblib.load('trained_model.pkl') # To load the feature_extraction object, you need to re-initialize it with the same parameters # or save it during training. # For simplicity, assuming the feature_extraction object was saved or can be recreated with identical parameters. # In a real-world scenario, you would save and load this object as well. feature_extraction = TfidfVectorizer(min_df=1, stop_words='english', lowercase=True) # Fit on some data (ideally the original X_train) to reconstruct the vocabulary # For demonstration, let's assume you have access to the original training messages or a similar corpus. # If you only saved the model, you'd need to re-fit TFIDF on representative text or save the fitted vectorizer. # As an example, we'll re-fit it on X_train if it was available, or a dummy list if not. # In this notebook, feature_extraction was fit globally, so we can re-create and use it. # If the `feature_extraction` object itself was saved, you would load it using joblib.load(). # Placeholder: In a deployed application, you'd load the fitted TfidfVectorizer # For this example, let's assume we refit it (not ideal for deployment) or had it saved. # Since the `feature_extraction` object is in the current kernel state, we can use it directly: # feature_extraction is already defined and fitted in the notebook state. # New email message to predict input_mail = ["Free entry in 2 a wkly comp to win FA Cup final tickets 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"] # Convert text to feature vectors input_data_features = feature_extraction.transform(input_mail) # Make prediction prediction = model.predict(input_data_features) if (prediction[0]==1): print('Ham mail') else: print('Spam mail')
The trained Logistic Regression model is saved as trained_model.pkl using the joblib library. This allows for easy deployment and reuse of the model without needing to retrain it every time.
This project is open-source and available under the MIT License.