Can AI-Generated Phishing Emails Evade Detection?

Datasets, and reproducibility instructions for the bachelor thesis “Can AI-Generated Phishing Emails Evade Detection? Testing Users and Security Solutions” (Computing Science, University of Groningen, July 2025).

Author Matei Teodoru (S5222745)
Supervisors dr. Fadi Mohsen & dr. George Azzopardi

1 Project Overview

Large Language Model (LLM) tools such as GPT-4o and Gemini 1.5 let attackers mass-produce phishing emails that are grammatically flawless. This repository benchmarks a traditional feature-based detector (20 handcrafted external + internal features → LinearSVC) against emails generated by GPT-4o / o4-mini, testing where the classic approach still holds, and where it breaks.

Pipeline (all in Main.ipynb):

Load datasets: Enron & SpamAssassin (HAM) + Nazario (phish).
Extract 10 external + 10 internal features per email.
Split & scale data (train_test_split, StandardScaler).
Train LinearSVC.
Persist the model & feature list.
Evaluate on the hold-out set and the LLM corpus.

2 Quick-Start (Python ≥ 3.12.5)

# Clone the repo
git clone https://github.com/MateiDev/Bachelor-Project.git
cd Bachelor-Project

# Install dependencies
pip install -r requirements.txt

3 Preparing the Datasets

Corpus	What it is	Path expected by the notebook
Nazario Phishing	4 × `phishing*.mbox` (4 572 mails)	`datasets_original/Phishing Nazario/phishing{0..3}.mbox`
SpamAssassin HAM	Combined mbox (4 150 mails)	`combined_datasets/combined_SpamAssasin.mbox` (already in repo)
Enron HAM	Big! Full Enron Maildir → single mbox (≈ 6 GB)	`combined_datasets/combined_Enron.mbox` (user builds locally)

3.1 Build the Enron mbox

Download the original Maildir archive, link in datasets_original/readme.md.
Run the helper script and place the resulting file in combined_datasets/:

python datasets_original/maildir_to_mbox.py /path/to/enron-maildir

4 Running the Notebook

Open Main.ipynb and run all cells.
Tested on Python 3.12.5 (macOS 15)

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
AI-generated-HAM		AI-generated-HAM
AI-generated-phishing		AI-generated-phishing
Results		Results
combined_datasets		combined_datasets
datasets_original		datasets_original
notable_answers		notable_answers
.DS_Store		.DS_Store
.gitignore		.gitignore
Main.ipynb		Main.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Can AI-Generated Phishing Emails Evade Detection?

1 Project Overview

2 Quick-Start (Python ≥ 3.12.5)

3 Preparing the Datasets

3.1 Build the Enron mbox

4 Running the Notebook

About

Uh oh!

Releases

Packages

Languages

MateiDev/Bachelor-Project

Folders and files

Latest commit

History

Repository files navigation

Can AI-Generated Phishing Emails Evade Detection?

1 Project Overview

2 Quick-Start (Python ≥ 3.12.5)

3 Preparing the Datasets

3.1 Build the Enron mbox

4 Running the Notebook

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages