Skip to content

TalkingJupiter/Slimming-Models-Saving-Watts

Repository files navigation

Slimming Models, Saving Watts

Welcome. This repository is an HPC-oriented research framework for studying how knowledge distillation changes the task performance and energy cost of large language models. It covers the full experiment lifecycle: preparing data, generating reusable teacher signals, training student variants, collecting telemetry, and comparing model quality with energy per generated token.

The code targets Slurm clusters with NVIDIA GPUs. Current scheduler profiles support H100 nodes on REPACSS and A100 nodes on HPCC.

What this project compares

The repository supports four student-training strategies:

  • Response-based KD (rb) learns from cached teacher top-k logits.
  • Feature-based KD (fb) aligns selected teacher and student hidden states.
  • Relation-based KD (relb) preserves relationships between teacher representations.
  • Traditional SFT trains the student without KD and provides a control.

Every training path can collect GPU/CPU telemetry. Trained models are evaluated with:

  • LM Evaluation Harness for task quality on MMLU, HellaSwag, BBH, and ARC-Challenge.
  • EPT-Bench for energy, throughput, latency, and energy per token across a generation-length sweep.

Workflow at a glance

Hugging Face datasets
        |
        v
data/shards.jsonl -------------------------> traditional SFT
        |
        v
warm teacher and student model snapshots
        |
        v
teacher caches
  +-> top-k logits ------------------------> response KD
  +-> hidden states -----------------------> feature KD
  +-> embeddings --------------------------> relation KD
                                                   |
                              +--------------------+--------------------+
                              |                                         |
                              v                                         v
                    LM Harness evaluation                    EPT energy profiling

Teacher inference is performed once during cache generation. KD training reads the resulting Parquet shards instead of keeping the teacher resident beside the student, reducing repeated teacher compute and making methods easier to compare.

Repository layout

configs/                 KD configuration examples
data/                    dataset mixing and JSONL shard generation
eval/                    LM Harness and EPT evaluation
kd/                      KD datasets, losses, models, and training
scripts/                 environment, cache, training, and submission helpers
teacher_farm/            teacher-signal cache builders
traditional-model/       supervised fine-tuning baseline
job_title.sh             Slurm resources and job-kind mapping
submit_a100.sh            A100 training and harness launcher
submit_h100.sh            H100 harness launcher and dependency template
monitor.py               training/cache telemetry collector
results/                  generated telemetry and evaluation outputs
serialization_dir/        generated distilled PEFT checkpoints
traditional_student/      generated traditional SFT checkpoints

Superseded launchers are retained under retired/ for reference.

Prerequisites

You need Linux, Slurm, Conda, NVIDIA GPUs with CUDA/NVML, and access to the configured Hugging Face models and datasets. Gated models require a Hugging Face token.

The shared environment helper creates or activates a Python 3.10 Conda environment named kd, installs requirements.txt when first creating it, and configures project-local Hugging Face caches:

source scripts/_env_single_node.sh

Runtime cache data is stored in .hf_cache/; complete warmed model snapshots are stored in .hf_models/. These locations can be overridden for a cluster's preferred scratch filesystem.

The configured Slurm partitions, REPACSS reservation/node selection, and the CUDA path in the traditional SFT launcher are site-specific. Review job_title.sh and traditional-model/slurm/run_sft.sh before using another cluster.

The KD launchers also expect a DeepSpeed configuration at configs/ds_zero3.json. Supply a cluster-appropriate ZeRO-3 configuration at that path before starting a KD training job.

Models and shared settings

Launchers are configured primarily through environment variables:

export TEACHER="meta-llama/Meta-Llama-3.1-70B"
export STUDENT="meta-llama/Meta-Llama-3.1-8B"
export TEACHER_DATA="${TEACHER//\//_}"

TEACHER_DATA names the cache directory under data/. Keep it consistent between cache generation and KD training.

General workflow

1. Build the training corpus

scripts/run_build_shards.sh mixes one or more Hugging Face datasets into the shared JSONL input:

HF_DATASETS="allenai/tulu-3-sft-mixture,HuggingFaceFW/fineweb-edu" \
WEIGHTS="3,1" \
MAX_SAMPLES=300000 \
STREAMING=1 \
OUT=data/shards.jsonl \
bash scripts/run_build_shards.sh

The cache builders and traditional SFT baseline consume this same file.

2. Warm model snapshots

Cache jobs default to local-only loading, so warm the teacher and student before submitting cache arrays:

TEACHER="$TEACHER" STUDENT="$STUDENT" bash scripts/warm_hf_cache.sh

resolve_hf_model automatically prefers snapshots under .hf_models/<safe_model_name>/.

3. Generate teacher caches

Each Slurm array task reads part of data/shards.jsonl, writes Parquet shards, and records cache-generation telemetry.

# H100 / REPACSS
bash job_title.sh repacss build_feature_cache
bash job_title.sh repacss build_relation_cache
bash job_title.sh repacss build_response_cache

# A100 / HPCC
bash job_title.sh hpcc build_feature_cache
bash job_title.sh hpcc build_relation_cache
bash job_title.sh hpcc build_response_cache

Outputs are organized by signal type:

data/<teacher>/fb_hints_L<teacher_layer>/*.parquet
data/<teacher>/relb_embeds/*.parquet
data/<teacher>/topk_k16/*.parquet

Feature KD chooses teacher/student layers from model depth using FEATURE_LAYER_RATIO (default 0.60). The resolved mapping is saved in feature_layer_map.json. Explicit FEATURE_TEACHER_LAYER and FEATURE_STUDENT_LAYER values override the ratio.

4. Train student variants

The A100 convenience launcher exposes each active training mode:

bash submit_a100.sh feature
bash submit_a100.sh relation
bash submit_a100.sh response
bash submit_a100.sh traditional

For H100, use the canonical resource mapper directly:

bash job_title.sh repacss feature
bash job_title.sh repacss relation
bash job_title.sh repacss response
bash job_title.sh repacss traditional

The KD launchers use LoRA, Accelerate, and DeepSpeed and save adapters under:

serialization_dir/<student>/<feature|relation|response>/<array_task_id>/

Traditional SFT saves full checkpoints under:

traditional_student/<student>/<array_task_id>/

Training telemetry is stored under results/<student>/<method>/<array_task_id>/. KD launchers use automatic resume where supported.

5. Evaluate task quality

Harness evaluation discovers checkpoints by method, warms the shared lm-eval dataset cache, and creates one task per checkpoint x repeat x benchmark combination.

# One method
bash submit_h100.sh harness_feature
bash submit_h100.sh harness_relation
bash submit_h100.sh harness_response
bash submit_h100.sh harness_traditional

# All H100 harness families, including base student and teacher
bash submit_h100.sh harness_all

# A100 distilled/traditional families
bash submit_a100.sh harness_all

Useful overrides:

HARNESS_BENCHES="mmlu,hellaswag,bbh,arc_challenge"
HARNESS_REPEATS=1
HARNESS_BATCH_SIZE=auto
HARNESS_BBH_LIMIT=50

Results are written to:

results/<model>/<method>/<model_index>/harness/<benchmark>_repeat<repeat>.json

6. Measure energy per token

EPT jobs evaluate base models, distilled adapters, and traditional SFT checkpoints with deterministic decoding:

EPT_REPEATS=5 bash job_title.sh repacss ept_student
EPT_REPEATS=5 bash job_title.sh repacss ept_teacher
EPT_REPEATS=5 bash job_title.sh repacss ept_feature
EPT_REPEATS=5 bash job_title.sh repacss ept_relation
EPT_REPEATS=5 bash job_title.sh repacss ept_response
EPT_REPEATS=5 bash job_title.sh repacss ept_trad_student

Use hpcc instead of repacss for A100 jobs. Outputs follow the same model/method hierarchy:

results/<model>/BASE/EPT/ept_base<array_task_id>.json
results/<student>/<method>/<model_index>/EPT/ept_repeat<repeat>.json

See eval/README.md for complete harness and EPT array logic, runtime overrides, output schemas, and log paths.

Slurm launchers

job_title.sh is the source of truth for job resources. It maps a target and job kind to partitions, GPUs, CPUs, memory, arrays, logs, and an execution script:

bash job_title.sh <repacss|hpcc> <job_kind>

The top-level launchers are intentionally thin:

  • submit_a100.sh submits one training or harness mode at a time.
  • submit_h100.sh currently submits harness modes. Its commented submission block documents the intended end-to-end dependency chain.

Cache, KD, traditional, and checkpoint-based EPT arrays currently assume five experimental runs. Adjust array sizes in job_title.sh when that count changes.

Outputs and reproducibility

data/<teacher>/...                         teacher cache shards
serialization_dir/<student>/<method>/...  distilled adapters
traditional_student/<student>/...         full SFT checkpoints
results/<student>/<method>/...            telemetry and evaluations
logs/...                                   Slurm and structured task logs

For a reproducible experiment, record the model IDs, dataset mix, feature-layer mapping, array size, training arguments, evaluation repeats, and benchmark limits. Scripts print these values into logs, while EPT JSON files include checkpoint and Slurm metadata.

Development notes

  • Use bash -n <script> before submitting expensive shell jobs.
  • Use a small MAX_SAMPLES, HARNESS_LIMIT, or EPT_NUM_PROMPTS for smoke tests.
  • Do not reuse a TEACHER_DATA directory with another teacher unless caches are rebuilt.
  • EPT requires NVML access on the compute node.

License

See LICENSE.

About

Slimming Models, Saving Watts — A research framework and benchmark for measuring the accuracy, efficiency, and energy footprint of response-, feature-, and relation-based knowledge distillation across large language models such as Llama 3.1.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors