feat(safety-lora): Sprint 17 — safety-subspace LoRA (SaLoRA / SPLoRA) (v1.7.0) by konjoinfinity · Pull Request #9 · konjoai/toki

konjoinfinity · 2026-06-14T14:27:40Z

Summary

toki.safety_lora (new module) — three complementary safety-preserving LoRA techniques from the 2025-2026 literature, all validated on 1B-3B models (toki's target range)
toki.finetune (extended) — LoRAConfig gains four new safety fields; LoRAFinetuner.train() returns LoRATrainResult and wires in the safety hooks
CLI — python -m toki finetune --safety-lora-rank --safety-subspace --splora-audit

Research basis

Paper	arXiv	Technique	Implementation
SaLoRA	2501.01765	Frozen safety delta before task fine-tuning	`load_safety_subspace` + `freeze_safety_adapter`
SPLoRA (TACL)	2506.18931	Post-hoc E-DIEM audit of weight-update shifts	`splora_audit` + `SploraAuditResult`
Rank-1 MLP alignment	2507.17075	Rank-1 up_proj safety adapter; zero reasoning tax	`safety_lora_rank` field in `LoRAConfig`

New symbols

Symbol	Module	Description
`SafetyLoRAConfig`	safety_lora	Four safety fields with safe defaults
`SploraAuditResult`	safety_lora	Frozen dataclass: flagged_layers, max_ediem, passed, threshold
`LoRATrainResult`	safety_lora	training_loss + num_steps + optional SploraAuditResult
`load_safety_subspace(path)`	safety_lora	Load safety delta checkpoint (SaLoRA)
`freeze_safety_adapter(model, delta)`	safety_lora	Apply + freeze safety params; no-op when delta is None
`splora_audit(model, base_state, threshold)`	safety_lora	E-DIEM post-hoc safety audit (SPLoRA)

All torch/peft operations are behind try-import guards — raise ImportError("requires toki[hf]: pip install toki[hf]") cleanly when deps absent.

Backward compatibility

LoRAConfig with no safety fields set (all defaults) produces identical training behaviour to v1.6.0 — confirmed by test_finetune_extended.py::test_train_no_safety_fields_no_audit.

Test plan

test_safety_lora.py — 24 tests: SafetyLoRAConfig, SploraAuditResult, LoRATrainResult, load/freeze/audit import guards, no-op paths, mock-tensor integration
test_finetune_extended.py — 15 tests: LoRAConfig new fields, backward compat, LoRAFinetuner construction, train() with mocked torch
test_main.py additions — 5 CLI tests: finetune config print, flag reflection, import guard
644 / 644 tests passing (600 → 644)
cargo test green, cargo clippy -- -D warnings clean

Unblocks

P3-2 (compliance certification report) — can now report on safety-preservation during fine-tuning
P3-1 (dual-agent red-team loop) — full pipeline now includes both the evaluator reliability fix (Sprint 16) and the safety-subspace fine-tuning protection (Sprint 17)

https://claude.ai/code/session_01XCHiLCiVeL6WXQdsAcQTbx

Generated by Claude Code

… (v1.7.0) Prevents LoRA fine-tuning from silently erasing safety alignment across three complementary techniques validated on 1B-3B models (toki's target range). - toki.safety_lora (new module): SafetyLoRAConfig — four safety fields with defaults (all disabled) SploraAuditResult — frozen dataclass: flagged_layers, max_ediem, passed, threshold LoRATrainResult — wraps training_loss + num_steps + optional SploraAuditResult load_safety_subspace(path) — load safety delta .pt checkpoint (SaLoRA, arXiv 2501.01765) freeze_safety_adapter(model, delta) — apply + freeze safety params; no-op when None _ediem(base, ft) — normalised Frobenius distance for E-DIEM approximation splora_audit(model, base_state, threshold) — post-hoc E-DIEM audit (SPLoRA, arXiv 2506.18931) All torch operations behind try-import guards; raises ImportError("toki[hf]") cleanly - toki.finetune (extended): LoRAConfig — four new safety fields (safety_lora_rank, safety_subspace_path, enable_splora_audit, splora_threshold); all default to disabled (backward compat) LoRAFinetuner.train() — returns LoRATrainResult; hooks: load+freeze before training, E-DIEM audit after training; audit attached to result when enabled config_summary() — includes safety fields - CLI: python -m toki finetune --safety-lora-rank --safety-subspace --splora-audit - pyproject.toml: version 1.7.0 - 44 new tests (24 safety_lora + 15 finetune_extended + 5 CLI) - 644 total tests passing (600 → 644) https://claude.ai/code/session_01XCHiLCiVeL6WXQdsAcQTbx

wesleyscholl marked this pull request as ready for review June 14, 2026 14:57

wesleyscholl merged commit 3a5e474 into main Jun 14, 2026
7 checks passed

wesleyscholl deleted the claude/konjo-toki-xPy83 branch June 14, 2026 14:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(safety-lora): Sprint 17 — safety-subspace LoRA (SaLoRA / SPLoRA) (v1.7.0)#9

feat(safety-lora): Sprint 17 — safety-subspace LoRA (SaLoRA / SPLoRA) (v1.7.0)#9
wesleyscholl merged 1 commit into
mainfrom
claude/konjo-toki-xPy83

konjoinfinity commented Jun 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

konjoinfinity commented Jun 14, 2026

Summary

Research basis

New symbols

Backward compatibility

Test plan

Unblocks

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants