Skip to content

MateiDev/Bachelor-Project

Repository files navigation

Can AI-Generated Phishing Emails Evade Detection?

Datasets, and reproducibility instructions for the bachelor thesis “Can AI-Generated Phishing Emails Evade Detection? Testing Users and Security Solutions” (Computing Science, University of Groningen, July 2025).

Author Matei Teodoru (S5222745)
Supervisors dr. Fadi Mohsen & dr. George Azzopardi


1 Project Overview

Large Language Model (LLM) tools such as GPT-4o and Gemini 1.5 let attackers mass-produce phishing emails that are grammatically flawless. This repository benchmarks a traditional feature-based detector (20 handcrafted external + internal features → LinearSVC) against emails generated by GPT-4o / o4-mini, testing where the classic approach still holds, and where it breaks.

Pipeline (all in Main.ipynb):

  1. Load datasets: Enron & SpamAssassin (HAM) + Nazario (phish).
  2. Extract 10 external + 10 internal features per email.
  3. Split & scale data (train_test_split, StandardScaler).
  4. Train LinearSVC.
  5. Persist the model & feature list.
  6. Evaluate on the hold-out set and the LLM corpus.

2 Quick-Start (Python ≥ 3.12.5)

# Clone the repo
git clone https://github.com/MateiDev/Bachelor-Project.git
cd Bachelor-Project

# Install dependencies
pip install -r requirements.txt

3 Preparing the Datasets

Corpus What it is Path expected by the notebook
Nazario Phishing 4 × phishing*.mbox (4 572 mails) datasets_original/Phishing Nazario/phishing{0..3}.mbox
SpamAssassin HAM Combined mbox (4 150 mails) combined_datasets/combined_SpamAssasin.mbox (already in repo)
Enron HAM Big! Full Enron Maildir → single mbox (≈ 6 GB) combined_datasets/combined_Enron.mbox (user builds locally)

3.1 Build the Enron mbox

  1. Download the original Maildir archive, link in datasets_original/readme.md.
  2. Run the helper script and place the resulting file in combined_datasets/:
python datasets_original/maildir_to_mbox.py /path/to/enron-maildir

4 Running the Notebook

Open Main.ipynb and run all cells.
Tested on Python 3.12.5 (macOS 15)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published