A small static-analysis malware classifier (gradient-boosted trees over a 24-dim PE feature vector), with an adversarial-evasion harness that measures how four feature-manipulation attacks change the bypass rate.
This repo ships with a synthetic PE-feature generator so the project is
clonable and runnable end-to-end without supplying real malware samples. The
generator's distributions are loosely matched to typical malware vs. benign
PE statistics. For real experiments, swap the generator for a real
pefile-based extractor on EMBER / MalwareBazaar (see data/README.md).
The 24 features are grouped into:
| Group | Features |
|---|---|
| Header metadata | machine type, subsystem, DLL chars, timestamp, header size |
| Section statistics | section count, entropy stats, size stats |
| Imports | total / unique DLLs, suspicious count, import-hash entropy |
| Strings | URL count, IP count, registry keys, paths, entropy |
| Packer indicators | packed flag, high-entropy + low-imports, name anomalies, overlay size |
The eval harness applies four feature-space attacks to malware samples:
| Attack | What it does |
|---|---|
| Entropy padding | Adds benign data to lower section entropy |
| Import obfuscation | Removes "suspicious imports" features |
| Section renaming | Removes section-name anomaly flag |
| Timestamp manipulation | Zeroes the timestamp-zero flag |
A typical run shows that import obfuscation alone collapses detection — which is the interesting research finding the project is designed to surface.
05_malware_classifier/
├── features/
│ ├── feature_names.py
│ └── synthetic_data.py
├── models/
│ ├── train.py
│ └── classifier.pkl # generated
├── evaluation/
│ ├── evaluate.py
│ └── adversarial_eval.py
├── data/
│ └── README.md # how to use real data
├── results/
├── requirements.txt
└── README.md
cd 05_malware_classifier
python -m venv .venv
source .venv/bin/activate # Linux / macOS
# .venv\Scripts\Activate.ps1 # Windows PowerShell
pip install -r requirements.txt# 1. Train the model on the synthetic dataset
python models/train.py
# 2. Run held-out evaluation
python evaluation/evaluate.py
# 3. Run the adversarial-evasion harness
python evaluation/adversarial_eval.pySample adversarial output:
Attack Detection Bypass rate
-----------------------------------------------------
entropy_padding 98.75% 1.25%
import_obfuscation 0.75% 99.25%
section_renaming 100.00% 0.00%
timestamp_manipulation 100.00% 0.00%
- Replace the synthetic generator with a real
pefile-based feature extractor - Train on EMBER and re-run the adversarial harness
- Add a defence layer: ensemble with an LLM-based behavioural review