Datasets, and reproducibility instructions for the bachelor thesis “Can AI-Generated Phishing Emails Evade Detection? Testing Users and Security Solutions” (Computing Science, University of Groningen, July 2025).
Author Matei Teodoru (S5222745)
Supervisors dr. Fadi Mohsen & dr. George Azzopardi
Large Language Model (LLM) tools such as GPT-4o and Gemini 1.5 let attackers mass-produce phishing emails that are grammatically flawless.
This repository benchmarks a traditional feature-based detector (20 handcrafted external + internal features → LinearSVC) against emails generated by GPT-4o / o4-mini, testing where the classic approach still holds, and where it breaks.
Pipeline (all in Main.ipynb):
- Load datasets: Enron & SpamAssassin (HAM) + Nazario (phish).
- Extract 10 external + 10 internal features per email.
- Split & scale data (
train_test_split,StandardScaler). - Train
LinearSVC. - Persist the model & feature list.
- Evaluate on the hold-out set and the LLM corpus.
# Clone the repo
git clone https://github.com/MateiDev/Bachelor-Project.git
cd Bachelor-Project
# Install dependencies
pip install -r requirements.txt| Corpus | What it is | Path expected by the notebook |
|---|---|---|
| Nazario Phishing | 4 × phishing*.mbox (4 572 mails) |
datasets_original/Phishing Nazario/phishing{0..3}.mbox |
| SpamAssassin HAM | Combined mbox (4 150 mails) | combined_datasets/combined_SpamAssasin.mbox (already in repo) |
| Enron HAM | Big! Full Enron Maildir → single mbox (≈ 6 GB) | combined_datasets/combined_Enron.mbox (user builds locally) |
- Download the original Maildir archive, link in
datasets_original/readme.md. - Run the helper script and place the resulting file in
combined_datasets/:
python datasets_original/maildir_to_mbox.py /path/to/enron-maildirOpen Main.ipynb and run all cells.
Tested on Python 3.12.5 (macOS 15)