Skip to content

SriyanRavuri/malware_classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Malware Static Analysis Classifier

A small static-analysis malware classifier (gradient-boosted trees over a 24-dim PE feature vector), with an adversarial-evasion harness that measures how four feature-manipulation attacks change the bypass rate.

How it works

This repo ships with a synthetic PE-feature generator so the project is clonable and runnable end-to-end without supplying real malware samples. The generator's distributions are loosely matched to typical malware vs. benign PE statistics. For real experiments, swap the generator for a real pefile-based extractor on EMBER / MalwareBazaar (see data/README.md).

The 24 features are grouped into:

Group Features
Header metadata machine type, subsystem, DLL chars, timestamp, header size
Section statistics section count, entropy stats, size stats
Imports total / unique DLLs, suspicious count, import-hash entropy
Strings URL count, IP count, registry keys, paths, entropy
Packer indicators packed flag, high-entropy + low-imports, name anomalies, overlay size

Adversarial evasion

The eval harness applies four feature-space attacks to malware samples:

Attack What it does
Entropy padding Adds benign data to lower section entropy
Import obfuscation Removes "suspicious imports" features
Section renaming Removes section-name anomaly flag
Timestamp manipulation Zeroes the timestamp-zero flag

A typical run shows that import obfuscation alone collapses detection — which is the interesting research finding the project is designed to surface.

Project layout

05_malware_classifier/
├── features/
│   ├── feature_names.py
│   └── synthetic_data.py
├── models/
│   ├── train.py
│   └── classifier.pkl              # generated
├── evaluation/
│   ├── evaluate.py
│   └── adversarial_eval.py
├── data/
│   └── README.md                   # how to use real data
├── results/
├── requirements.txt
└── README.md

Setup

cd 05_malware_classifier

python -m venv .venv
source .venv/bin/activate              # Linux / macOS
# .venv\Scripts\Activate.ps1            # Windows PowerShell

pip install -r requirements.txt

Run it

# 1. Train the model on the synthetic dataset
python models/train.py

# 2. Run held-out evaluation
python evaluation/evaluate.py

# 3. Run the adversarial-evasion harness
python evaluation/adversarial_eval.py

Sample adversarial output:

Attack                       Detection  Bypass rate
-----------------------------------------------------
entropy_padding                98.75%       1.25%
import_obfuscation              0.75%      99.25%
section_renaming              100.00%       0.00%
timestamp_manipulation        100.00%       0.00%

Next steps

  • Replace the synthetic generator with a real pefile-based feature extractor
  • Train on EMBER and re-run the adversarial harness
  • Add a defence layer: ensemble with an LLM-based behavioural review

About

Static PE-feature malware classifier (gradient boosting) with adversarial-evasion harness. Synthetic dataset included.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages