Skip to content

RusaUB/malware-detection

Repository files navigation

Malware Detection

A lightweight, configurable CLI for loading the CIC-MalMem-2022 dataset, preprocessing features, training classic ML models for malware detection (binary) and family classification (multi‑class), visualizing results, and image-based malware classification (malimg).

  • Loads dataset from Hugging Face Datasets and caches locally
  • Preprocesses features, derives category_family and category_encoded
  • Trains and evaluates Random Forest, MLP, KNN, XGBoost
  • Saves trained models under a versioned directory structure
  • Generates Sankey and radar plots and confusion matrices for trained models

Quick Start

# 1) Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

# 2) Install dependencies
pip install -r requirements.txt

# 3) Preview the data (downloads and caches the dataset on first run)
python main.py --view head -r 10

# 4) Train models
# Binary detection (benign vs. malware)
python main.py --train detection
# Multi-class classification of malware family
python main.py --train classification
# Train both
python main.py --train both

# 5) Visualize
# Sankey of Category distribution
python main.py --plot sankey
# Radar of selected numeric features per Category
python main.py --plot radar -x handles.nsemaphore handles.nmutant -C Benign
# Confusion matrix for a saved model
python main.py --plot confusion --confusion-task classification --model-name XGBoost

If you encounter download issues (e.g., no internet), see “Dataset & Caching”.

Project Structure

.
├── main.py                # CLI entrypoint
├── config.yaml            # Project configuration (logging, data, models)
├── requirements.txt
├── malimg.py              # CLI entrypoint for malware image classification
├── malimg.yaml            # Configuration for malimg.py
├── malimg/                # Malware imaging related files
│   └── dataset/
│       └── malimg.npz     # Dataset for malimg
├── src/
│   ├── __init__.py
│   ├── config.py          # YAML loader with basic validation
│   ├── data.py            # HF dataset loader + preprocessing
│   ├── models.py          # Training/eval + model persistence
│   └── plot.py            # Sankey, radar, and confusion matrix plots
└── models/                # Created at runtime, stores saved models

Malware Imaging malimg

The malimg component provides functionality for malware image classification using Convolutional Neural Networks (CNNs). It allows for loading and preprocessing image-based malware datasets, training a CNN model, and evaluating its performance.

Usage

To run the malware imaging classification:

python malimg.py --help

This will display the available command-line arguments for malimg.py, including options for data paths, training epochs, batch size, and model saving.

Configuration

The behavior of malimg.py is controlled by malimg.yaml, which specifies parameters such as:

  • training: Epochs, batch size, test split size, and random seed.
  • paths: Locations for the dataset (malimg/dataset/malimg.npz) and saved models (malimg/models).
  • mappings: A dictionary mapping numerical labels to malware family names.

Workflow

  1. Data Loading and Preprocessing: Loads image data from malimg/dataset/malimg.npz, resizes images to 32x32 pixels, flattens them, and applies LabelEncoder to the labels.
  2. Model Building and Training: Constructs a CNN model for One-vs-All classification and trains it using the preprocessed image data.
  3. Model Evaluation: Evaluates the trained model on a test set and reports metrics such as accuracy, precision, recall, and F1-score.
  4. Model Persistence: Saves the trained CNN model to malimg/models/cnn_ova.keras for future use.

Configuration

All behavior is controlled via config.yaml.

logging:
  save: True
  dir: "logs/app.log"     # file path or directory; file created if suffix is present

data:
  dataset: "bvk/CIC-MalMem-2022"
  cache_dir: "data/"      # HF Datasets cache for this project
  selected_features:       # optional: normalize and keep only these
    - handles.nhandles
    - dlllist.ndlls
  selected_features_radar: # optional: which features to show on radar plot
    - svcscan.nservices
    - malfind.ninjections

models:
  target_col: "category_encoded"   # used for classification
  dirs:
    root: "models"
    binary: "binary"
    classification: "classification"
  params:                # per-model defaults (overrides are merged)
    Random Forest:
      n_estimators: 100
      random_state: 42
      n_jobs: -1
    MLP:
      hidden_layer_sizes: [100, 50]
      max_iter: 1000
      random_state: 42
    KNN:
      n_neighbors: 5
    XGBoost:
      n_estimators: 100
      random_state: 42
      verbosity: 0
      n_jobs: -1

Notes:

  • logging.dir can be a directory or a file path. If a filename is provided (has a suffix), logs go to that file; otherwise, app.log will be created inside the directory.
  • data.selected_features controls which numeric columns are standardized and kept after preprocessing. If omitted, all numeric features (excluding labels) are standardized and kept.
  • selected_features_radar filters which numeric features appear on the radar plot.
  • Model hyperparameters in models.params are merged with sensible defaults in code.

How It Works

  • Data loading (src/data.py)

    • Uses datasets.load_dataset("bvk/CIC-MalMem-2022") and converts the train split to a pandas DataFrame.
    • Derives category_family from Category by taking the substring before the first dash (e.g., “Ransomware-Foo” → “Ransomware”).
    • Encodes category_family into category_encoded via LabelEncoder.
    • If Class is a string column ("Malware"/"Benign"), it is converted to 1/0.
    • Standardizes numeric features (z-score) either on all numeric columns or only on data.selected_features if provided.
  • Training (src/models.py)

    • Binary detection uses Class (0/1) as the target.
    • Multi-class classification uses category_encoded as the target (ensure preprocessing ran).
    • Splits data with stratified train/test (--test-size, --seed).
    • Trains: Random Forest, MLP, KNN, XGBoost; reports Accuracy, Precision, Recall, F1.
    • Saves each model to models/<task>/<name>.joblib (skips retraining if the saved artifact exists and features match).
    • Optional hyperparameter tuning for classification (--tune-classification) via GridSearchCV with stratified 3‑fold.
  • Plotting (src/plot.py)

    • sankey: shows distribution of Category values.
    • radar: per-category mean values across selected numeric features (can --exclude features and --exclude-category).
    • confusion: loads a saved model and plots a confusion matrix on the test split. For detection, if a saved model is missing, it will train a fresh model on the fly; for classification, you must train first.

CLI Usage

General options:

python main.py --help

Viewing data:

# Show first 5 rows (default)
python main.py --view head
# Show a random sample of 10
python main.py --view sample --rows 10

Training:

# Detection (binary)
python main.py --train detection
# Classification (multi-class)
python main.py --train classification
# Train both
python main.py --train both

# Choose models directory (else taken from config)
python main.py --train detection --models-dir ./models

# Control split and seed
python main.py --train classification --test-size 0.25 --seed 123

# Enable hyperparameter tuning (classification only)
python main.py --train classification --tune-classification

Plotting:

# Sankey of Category distribution
python main.py --plot sankey

# Radar chart (exclude specific features and/or categories)
python main.py --plot radar --exclude handles.nsemaphore handles.nmutant --exclude-category Benign

# Confusion matrix (requires a saved model; train first for classification)
python main.py --plot confusion --confusion-task classification --model-name XGBoost
# For detection you can also visualize without prior save (auto-trains if missing)
python main.py --plot confusion --confusion-task detection --model-name Random Forest

Dataset & Caching

  • The first run will download bvk/CIC-MalMem-2022 from Hugging Face and cache it.
  • Cache location is controlled by data.cache_dir in config.yaml. You can also rely on standard HF cache env vars like HF_HOME or HF_DATASETS_CACHE if you prefer.
  • If running in a restricted/offline environment, pre-download the dataset on a machine with internet and copy the cache directory to this project’s data/ folder.

Logging & Outputs

  • Console logs are always enabled.
  • If logging.save: True, file logs are written to the path configured by logging.dir.
  • Trained models are saved under models/ by default:
    • models/binary/<model>.joblib
    • models/classification/<model>.joblib

Requirements

  • Python 3.10+
  • See requirements.txt for exact versions. Notable dependencies: datasets, pandas, scikit-learn, xgboost, matplotlib, plotly, seaborn.

Tips & Troubleshooting

  • Classification requires category_encoded; ensure preprocessing ran (it does in main.py).
  • If you change the feature set, previously saved models might be skipped or retrained depending on feature compatibility.
  • If plot confusion fails for classification due to a missing model, run a classification training first.
  • XGBoost objective is set automatically to multi-class when necessary.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors