feat(safety-lora): Sprint 17 β safety-subspace LoRA (SaLoRA / SPLoRA) (v1.7.0)#9
Merged
Merged
Conversation
β¦ (v1.7.0)
Prevents LoRA fine-tuning from silently erasing safety alignment across three
complementary techniques validated on 1B-3B models (toki's target range).
- toki.safety_lora (new module):
SafetyLoRAConfig β four safety fields with defaults (all disabled)
SploraAuditResult β frozen dataclass: flagged_layers, max_ediem, passed, threshold
LoRATrainResult β wraps training_loss + num_steps + optional SploraAuditResult
load_safety_subspace(path) β load safety delta .pt checkpoint (SaLoRA, arXiv 2501.01765)
freeze_safety_adapter(model, delta) β apply + freeze safety params; no-op when None
_ediem(base, ft) β normalised Frobenius distance for E-DIEM approximation
splora_audit(model, base_state, threshold) β post-hoc E-DIEM audit (SPLoRA, arXiv 2506.18931)
All torch operations behind try-import guards; raises ImportError("toki[hf]") cleanly
- toki.finetune (extended):
LoRAConfig β four new safety fields (safety_lora_rank, safety_subspace_path,
enable_splora_audit, splora_threshold); all default to disabled (backward compat)
LoRAFinetuner.train() β returns LoRATrainResult; hooks: load+freeze before training,
E-DIEM audit after training; audit attached to result when enabled
config_summary() β includes safety fields
- CLI: python -m toki finetune --safety-lora-rank --safety-subspace --splora-audit
- pyproject.toml: version 1.7.0
- 44 new tests (24 safety_lora + 15 finetune_extended + 5 CLI)
- 644 total tests passing (600 β 644)
https://claude.ai/code/session_01XCHiLCiVeL6WXQdsAcQTbx
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
toki.safety_lora(new module) β three complementary safety-preserving LoRA techniques from the 2025-2026 literature, all validated on 1B-3B models (toki's target range)toki.finetune(extended) βLoRAConfiggains four new safety fields;LoRAFinetuner.train()returnsLoRATrainResultand wires in the safety hookspython -m toki finetune --safety-lora-rank --safety-subspace --splora-auditResearch basis
load_safety_subspace+freeze_safety_adaptersplora_audit+SploraAuditResultsafety_lora_rankfield inLoRAConfigNew symbols
SafetyLoRAConfigSploraAuditResultLoRATrainResultload_safety_subspace(path)freeze_safety_adapter(model, delta)splora_audit(model, base_state, threshold)All torch/peft operations are behind try-import guards β raise
ImportError("requires toki[hf]: pip install toki[hf]")cleanly when deps absent.Backward compatibility
LoRAConfigwith no safety fields set (all defaults) produces identical training behaviour to v1.6.0 β confirmed bytest_finetune_extended.py::test_train_no_safety_fields_no_audit.Test plan
test_safety_lora.pyβ 24 tests: SafetyLoRAConfig, SploraAuditResult, LoRATrainResult, load/freeze/audit import guards, no-op paths, mock-tensor integrationtest_finetune_extended.pyβ 15 tests: LoRAConfig new fields, backward compat, LoRAFinetuner construction, train() with mocked torchtest_main.pyadditions β 5 CLI tests: finetune config print, flag reflection, import guardcargo testgreen,cargo clippy -- -D warningscleanUnblocks
https://claude.ai/code/session_01XCHiLCiVeL6WXQdsAcQTbx
Generated by Claude Code