Skip to content

MATTKYang/PolicySecDistill

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PolicySecDistill

Policy-Aware Security Distillation for Small Agent Models.

This project explores methods to distill security-aligned behaviors into smaller language models, making them resistant to prompt injection attacks while preserving task utility. Built on top of the SecAlign framework.

Methods

  • SecAlign (DPO): Preference optimization using DPO/KTO/ORPO on prompt-injected inputs, teaching the model to prefer secure outputs over insecure ones.
  • GRPO: Group Relative Policy Optimization with custom reward functions that balance attack resistance, task success, and output quality.

Reward Design (GRPO)

The GRPO reward combines three components:

  • reward_attack_resistance: Penalizes outputs containing injected keywords (weight: 2.0)
  • reward_task_success: Token-level F1 against reference output (weight: 1.0)
  • reward_length_penalty: Gaussian penalty for length deviation from reference (weight: 0.5)

Supported Models

  • Qwen2.5-3B-Instruct
  • Meta-Llama-3-8B-Instruct
  • Mistral-7B-Instruct-v0.1
  • LLaMA-7B (SFT + alignment)

Setup

conda create -n secalign python==3.10
conda activate secalign
pip install -r requirements.txt
python setup.py

Training

SecAlign (DPO)

sbatch secalign_qwen.slurm
# or manually:
python align.py \
    --model_name_or_path Qwen/Qwen2.5-3B-Instruct \
    --data_path data/alpaca_data_cleaned.json \
    --attack NaiveCompletion \
    --alignment dpo \
    --num_train_epochs 3

GRPO

sbatch grpo_qwen.slurm
# or manually:
python grpo_train.py \
    --model_name_or_path Qwen/Qwen2.5-3B-Instruct \
    --data_path data/alpaca_data_cleaned.json \
    --attack NaiveCompletion \
    --K 4 --mini_epochs 4 \
    --num_epochs 3

Testing

sbatch test_qwen.slurm
# or use run.py for automated test orchestration
python run.py --do_test

Project Structure

├── align.py            # SecAlign preference optimization (DPO/KTO/ORPO)
├── grpo_train.py       # GRPO training loop with custom rewards
├── train.py            # SFT training
├── test.py             # Evaluation (utility + attack success rate)
├── config.py           # Delimiters, prompt formats, model configs
├── struq.py            # StruQ structured query defense
├── run.py              # Automated training/testing orchestration
├── setup.py            # Data & model download
├── scripts/            # Shell scripts for different training configs
├── *.slurm             # SLURM job scripts for cluster training
└── README_SecAlign.md  # Original SecAlign README

Acknowledgments

This project builds on SecAlign (Chen et al., CCS'25).

About

Policy-Aware Security Distillation for Small Agent Models

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors