Malware Static Analysis Classifier

A small static-analysis malware classifier (gradient-boosted trees over a 24-dim PE feature vector), with an adversarial-evasion harness that measures how four feature-manipulation attacks change the bypass rate.

How it works

This repo ships with a synthetic PE-feature generator so the project is clonable and runnable end-to-end without supplying real malware samples. The generator's distributions are loosely matched to typical malware vs. benign PE statistics. For real experiments, swap the generator for a real pefile-based extractor on EMBER / MalwareBazaar (see data/README.md).

The 24 features are grouped into:

Group	Features
Header metadata	machine type, subsystem, DLL chars, timestamp, header size
Section statistics	section count, entropy stats, size stats
Imports	total / unique DLLs, suspicious count, import-hash entropy
Strings	URL count, IP count, registry keys, paths, entropy
Packer indicators	packed flag, high-entropy + low-imports, name anomalies, overlay size

Adversarial evasion

The eval harness applies four feature-space attacks to malware samples:

Attack	What it does
Entropy padding	Adds benign data to lower section entropy
Import obfuscation	Removes "suspicious imports" features
Section renaming	Removes section-name anomaly flag
Timestamp manipulation	Zeroes the timestamp-zero flag

A typical run shows that import obfuscation alone collapses detection — which is the interesting research finding the project is designed to surface.

Project layout

05_malware_classifier/
├── features/
│   ├── feature_names.py
│   └── synthetic_data.py
├── models/
│   ├── train.py
│   └── classifier.pkl              # generated
├── evaluation/
│   ├── evaluate.py
│   └── adversarial_eval.py
├── data/
│   └── README.md                   # how to use real data
├── results/
├── requirements.txt
└── README.md

Setup

cd 05_malware_classifier

python -m venv .venv
source .venv/bin/activate              # Linux / macOS
# .venv\Scripts\Activate.ps1            # Windows PowerShell

pip install -r requirements.txt

Run it

# 1. Train the model on the synthetic dataset
python models/train.py

# 2. Run held-out evaluation
python evaluation/evaluate.py

# 3. Run the adversarial-evasion harness
python evaluation/adversarial_eval.py

Sample adversarial output:

Attack                       Detection  Bypass rate
-----------------------------------------------------
entropy_padding                98.75%       1.25%
import_obfuscation              0.75%      99.25%
section_renaming              100.00%       0.00%
timestamp_manipulation        100.00%       0.00%

Next steps

Replace the synthetic generator with a real pefile-based feature extractor
Train on EMBER and re-run the adversarial harness
Add a defence layer: ensemble with an LLM-based behavioural review

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Malware Static Analysis Classifier

How it works

Adversarial evasion

Project layout

Setup

Run it

Next steps

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
evaluation		evaluation
features		features
models		models
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Malware Static Analysis Classifier

How it works

Adversarial evasion

Project layout

Setup

Run it

Next steps

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages