Policy-Aware Security Distillation for Small Agent Models.
This project explores methods to distill security-aligned behaviors into smaller language models, making them resistant to prompt injection attacks while preserving task utility. Built on top of the SecAlign framework.
- SecAlign (DPO): Preference optimization using DPO/KTO/ORPO on prompt-injected inputs, teaching the model to prefer secure outputs over insecure ones.
- GRPO: Group Relative Policy Optimization with custom reward functions that balance attack resistance, task success, and output quality.
The GRPO reward combines three components:
reward_attack_resistance: Penalizes outputs containing injected keywords (weight: 2.0)reward_task_success: Token-level F1 against reference output (weight: 1.0)reward_length_penalty: Gaussian penalty for length deviation from reference (weight: 0.5)
- Qwen2.5-3B-Instruct
- Meta-Llama-3-8B-Instruct
- Mistral-7B-Instruct-v0.1
- LLaMA-7B (SFT + alignment)
conda create -n secalign python==3.10
conda activate secalign
pip install -r requirements.txt
python setup.pysbatch secalign_qwen.slurm
# or manually:
python align.py \
--model_name_or_path Qwen/Qwen2.5-3B-Instruct \
--data_path data/alpaca_data_cleaned.json \
--attack NaiveCompletion \
--alignment dpo \
--num_train_epochs 3sbatch grpo_qwen.slurm
# or manually:
python grpo_train.py \
--model_name_or_path Qwen/Qwen2.5-3B-Instruct \
--data_path data/alpaca_data_cleaned.json \
--attack NaiveCompletion \
--K 4 --mini_epochs 4 \
--num_epochs 3sbatch test_qwen.slurm
# or use run.py for automated test orchestration
python run.py --do_test├── align.py # SecAlign preference optimization (DPO/KTO/ORPO)
├── grpo_train.py # GRPO training loop with custom rewards
├── train.py # SFT training
├── test.py # Evaluation (utility + attack success rate)
├── config.py # Delimiters, prompt formats, model configs
├── struq.py # StruQ structured query defense
├── run.py # Automated training/testing orchestration
├── setup.py # Data & model download
├── scripts/ # Shell scripts for different training configs
├── *.slurm # SLURM job scripts for cluster training
└── README_SecAlign.md # Original SecAlign README
This project builds on SecAlign (Chen et al., CCS'25).