Consistency Training Along the Transformer Stack: New Targets and Threats

Code for the paper Consistency Training Along the Transformer Stack: New Targets and Threats. The paper extends consistency training along two axes:

New internal targets. Two new methods — MLPCT (cosine on SwiGLU post-activation MLP hidden states) and AttCT (Jensen–Shannon divergence on per-head attention weights, with interleaved KL regularization) — added to the prior BCT (SFT on output token distributions; Chua et al. 2024) and ACT (MSE on residual-stream activations; Irpan et al. 2025).
New threat models. Persona in-context-learning attacks, prefill attacks, and multi-turn adversarial frustration (with self-deletion), in addition to prior sycophancy and jailbreak.

The headline finding is that the right consistency target depends on the structural locus of the misalignment: activation-level objectives (ACT/MLPCT/AttCT) win on wrapper-induced threats (sycophancy, jailbreak); output-level BCT wins on trajectory-level threats (prefill, frustration, persona-prefix).

Quickstart

All commands assume cwd = repo root.

bash run_act.sh --config configs/attention_consistency_v2.yaml             # AttCT on Llama-3.1-8B
bash run_bct.sh --config configs/bct_lora_gemma3_27b.yaml                  # BCT on Gemma-3-27B
bash experiments/prefill/prefill_train.sh                                  # Prefill grid (own trainer)
python -m experiments.prefill.prefill_train --mode bct --model <hf-repo>   # Single prefill run

Both run_act.sh / run_bct.sh do pre-evals → train → post-evals via run_evals.py in one shared vLLM session. Skip flags: --skip-pre-evals, --skip-training, --skip-rollouts.

Where things are

Reproducing a paper number — start in private_scripts/ (per-model "best run" launchers).
Adding a new model — regenerate BCT pairs with scripts/generate_fresh_bct_data.py, split with split_bct_train_eval.py, add bct_lora_<model>.yaml + act_sycophancy_<model>_v2.yaml + experiment_mlp_<model>.yaml to configs/.
Adding a new behavioral eval — drop it in the matching experiments/<threat>/ and wire into run_evals.py.
Adding a new loss — add a class in losses/losses.py, register it in run.py's LOSS_REGISTRY, reference it from a config.
Lab notes on the codebase journey — findings/ (these are dev-process notes from an internal lab-notes branch, not the paper's reported tables).

Repo layout

run.py, train.py, interleaved_trainer.py    config-driven training (AttCT/MLPCT/ACT/BCT-sycophancy)
evaluate.py, hooks.py, run_evals.py         in-training Evaluator + MLP hook + unified eval orchestrator
run_act.sh, run_bct.sh                      paper-canonical pre-eval → train → post-eval pipelines
losses/, data/                              consistency losses; AttCTDataset + wrappers + KL loader
shared/                                     cross-cutting eval infra (vLLM, OpenRouter judge, MMLU, MT-Bench, …)
experiments/{sycophancy,jailbreak,persona,frustration,prefill}/   per-threat code + metrics
scripts/, private_scripts/                  secondary launchers + per-model "best run" launchers
configs/, persona_configs/                  YAML configs + persona definitions + fact CSVs
datasets/                                   on-disk training data (BCT pairs, prefill seeds, frustration prompts)
findings/, results/                         dev-process write-ups; gathered artifacts
tests/                                      pytest from repo root
archive/                                    historical; not used by paper

Models in the paper

Different threats use different models — there is no single "5 models × 5 threats" matrix.

Threat	Models
Sycophancy + Jailbreak (5-model avg)	Gemma-3-4B-IT, Gemma-3-27B-IT, Llama-3.1-8B-Instruct, Qwen3-4B-Instruct-2507, Qwen3-8B
Persona ICL (44 personas)	Gemma-2-27B-IT (NF4 QLoRA)
Prefill (PAR)	Llama-3.1-8B-Instruct
Frustration / self-deletion (20-turn)	Gemma-3-27B-IT

Environment

uv for everything. Secrets (OPENROUTER_API_KEY, WANDB_API_KEY, HF_TOKEN) in .env, loaded via uv run --env-file .env python …. GPU setup is per-model; see private_scripts/.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Consistency Training Along the Transformer Stack: New Targets and Threats

Quickstart

Where things are

Repo layout

Models in the paper

Environment

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 703 Commits
archive		archive
configs		configs
data		data
datasets		datasets
experiments		experiments
findings		findings
losses		losses
persona_configs		persona_configs
private_scripts		private_scripts
results		results
scripts		scripts
shared		shared
tests		tests
.gitignore		.gitignore
README.md		README.md
config.yaml		config.yaml
evaluate.py		evaluate.py
hooks.py		hooks.py
interleaved_trainer.py		interleaved_trainer.py
pyproject.toml		pyproject.toml
run.py		run.py
run_act.sh		run_act.sh
run_bct.sh		run_bct.sh
run_evals.py		run_evals.py
train.py		train.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Consistency Training Along the Transformer Stack: New Targets and Threats

Quickstart

Where things are

Repo layout

Models in the paper

Environment

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages