This repository provides a complete, HPC-oriented research framework for energy-aware knowledge distillation (KD) of large language models. It supports response-based, feature-based, and relation-based KD; integrates GPU/CPU telemetry logging; and measures energy-per-token (EPT), performance retention (OM_perf), and overall training/inference efficiency.
The framework is designed for HPC clusters, Slurm, and NVIDIA GPU environments (H100, A100, RTX).
Energy-Aware-Knowledge-Distillation/
│
├── README.md
├── requirements.txt
├── LICENSE
├── submit.sh
├── monitor.py
│
├── Base/
│ ├── Llama-3.1-70B-Ins_harness.sh
│ ├── Llama-3.1-8B-Ins_harness.sh
│ ├── Qwen2.5-72B-Ins_harness.sh
│ ├── Qwen2.5-7B-Ins_harness.sh
│ └── train_base_from_shards.py
│
├── configs/
│ ├── fb_base.yaml
│ ├── rb_base.yaml
│ └── relb_base.yaml
│
├── data/
│ └── build_shards_from_hf.py
│
├── eval/
│ ├── README.md
│ ├── benchmark_harness/
│ └── lighteval/
│
├── kd/
│ ├── __init__.py
│ ├── dataset.py
│ ├── distillers/
│ │ ├── feature_distiller.py
│ │ ├── response_distiller.py
│ │ └── relation_distiller.py
│ ├── loss_fns.py
│ ├── models.py
│ └── train.py
│
├── notebook/
│ ├── graphs/
│ │ ├── feature_energy_plot.ipynb
│ │ ├── relation_energy_plot.ipynb
│ │ └── response_energy_plot.ipynb
│ └── metrics/
│ ├── EFFoveral.ipynb
│ ├── ENERGYrun.ipynb
│ └── OMperf.ipynb
│
└── scripts/
├── _env_single_node.sh
├── build_caches.sh
├── plot.sh
├── run_base_from_shards_single_node.sh
├── run_build_shards.sh
├── run_eval_lighteval.sh
└── run_eval_lm_harness.sh
Implements three KD paradigms:
-
Response-Based KD
Classical teacher-logit matching via cross-entropy. -
Feature-Based KD
Intermediate representation alignment (FitNets-style). -
Relation-Based KD
Pairwise relational distance preservation (RDL).
Features include:
- Teacher/student model loading via Hugging Face
- Modular loss functions
- Multi-GPU Slurm compatibility
- Logging, checkpoints, telemetry hooks
The build_shards_from_hf.py script supports:
- Hugging Face dataset loading
- Tokenization + shard creation
- Memory-efficient distributed training
Shards improve:
- I/O performance
- Deterministic sampling
- Multi-node scaling
The telemetry collector records:
- GPU power (W)
- GPU utilization
- GPU memory usage
- GPU temperature
- CPU usage
- Timestamps
Outputs JSONL suitable for computing:
- E_run — Total energy (J)
- E_avg — Avg energy per interval
- EPT — Energy per token
- Eff_overall — Combined efficiency metric
Designed to run alongside KD training in Slurm jobs.
Contains Slurm-ready harness scripts for foundational training of:
- Llama-3.1-70B-Instruct
- Llama-3.1-8B-Instruct
- Qwen-2.5-72B-Instruct
- Qwen-2.5-7B-Instruct
These baselines support comparison against KD-student models for:
- OM_perf retention
- Energy-per-token improvements
Evaluates on standard benchmarks:
- MMLU
- ARC
- TruthfulQA
- GSM8K
- HellaSwag
Fast lightweight evaluator for iterative KD loops.
Outputs feed into notebooks analyzing:
- OM_perf
- Energy profiles
- Accuracy retention
Includes Jupyter notebooks for:
- Feature KD energy plots
- Response KD energy plots
- Relation KD energy plots
- OM_perf
- ENERGYrun
- EFFoverall
These notebooks provide visualizations for research publications.
Automation for:
- Environment setup
- Dataset caching/sharding
- KD training
- LM-harness evaluation
- Plotting pipelines
All scripts are Slurm-friendly and optimized for multi-GPU nodes.
pip install -r requirements.txt
bash scripts/run_build_shards.sh
bash Base/Llama-3.1-8B-Ins_harness.sh
python kd/train.py --config configs/rb_base.yaml
python monitor.py --output telemetry.jsonl
bash scripts/run_eval_lm_harness.sh
Open the notebooks in:
notebook/graphs/
notebook/metrics/
This framework enables analysis of:
- Energy efficiency of KD methods
- Scaling behavior on HPC systems
- Student-vs-teacher energy/performance retention
- Energy-per-token reduction from KD
- KD paradigm comparisons (response/feature/relation)
Suitable for SC, CCGrid, NeurIPS, and systems/ML research.
See LICENSE in the repository.