Skip to content

feat(safety-lora): Sprint 17 β€” safety-subspace LoRA (SaLoRA / SPLoRA) (v1.7.0)#9

Merged
wesleyscholl merged 1 commit into
mainfrom
claude/konjo-toki-xPy83
Jun 14, 2026
Merged

feat(safety-lora): Sprint 17 β€” safety-subspace LoRA (SaLoRA / SPLoRA) (v1.7.0)#9
wesleyscholl merged 1 commit into
mainfrom
claude/konjo-toki-xPy83

Conversation

@konjoinfinity

Copy link
Copy Markdown
Contributor

Summary

  • toki.safety_lora (new module) β€” three complementary safety-preserving LoRA techniques from the 2025-2026 literature, all validated on 1B-3B models (toki's target range)
  • toki.finetune (extended) β€” LoRAConfig gains four new safety fields; LoRAFinetuner.train() returns LoRATrainResult and wires in the safety hooks
  • CLI β€” python -m toki finetune --safety-lora-rank --safety-subspace --splora-audit

Research basis

Paper arXiv Technique Implementation
SaLoRA 2501.01765 Frozen safety delta before task fine-tuning load_safety_subspace + freeze_safety_adapter
SPLoRA (TACL) 2506.18931 Post-hoc E-DIEM audit of weight-update shifts splora_audit + SploraAuditResult
Rank-1 MLP alignment 2507.17075 Rank-1 up_proj safety adapter; zero reasoning tax safety_lora_rank field in LoRAConfig

New symbols

Symbol Module Description
SafetyLoRAConfig safety_lora Four safety fields with safe defaults
SploraAuditResult safety_lora Frozen dataclass: flagged_layers, max_ediem, passed, threshold
LoRATrainResult safety_lora training_loss + num_steps + optional SploraAuditResult
load_safety_subspace(path) safety_lora Load safety delta checkpoint (SaLoRA)
freeze_safety_adapter(model, delta) safety_lora Apply + freeze safety params; no-op when delta is None
splora_audit(model, base_state, threshold) safety_lora E-DIEM post-hoc safety audit (SPLoRA)

All torch/peft operations are behind try-import guards β€” raise ImportError("requires toki[hf]: pip install toki[hf]") cleanly when deps absent.

Backward compatibility

LoRAConfig with no safety fields set (all defaults) produces identical training behaviour to v1.6.0 β€” confirmed by test_finetune_extended.py::test_train_no_safety_fields_no_audit.

Test plan

  • test_safety_lora.py β€” 24 tests: SafetyLoRAConfig, SploraAuditResult, LoRATrainResult, load/freeze/audit import guards, no-op paths, mock-tensor integration
  • test_finetune_extended.py β€” 15 tests: LoRAConfig new fields, backward compat, LoRAFinetuner construction, train() with mocked torch
  • test_main.py additions β€” 5 CLI tests: finetune config print, flag reflection, import guard
  • 644 / 644 tests passing (600 β†’ 644)
  • cargo test green, cargo clippy -- -D warnings clean

Unblocks

  • P3-2 (compliance certification report) β€” can now report on safety-preservation during fine-tuning
  • P3-1 (dual-agent red-team loop) β€” full pipeline now includes both the evaluator reliability fix (Sprint 16) and the safety-subspace fine-tuning protection (Sprint 17)

https://claude.ai/code/session_01XCHiLCiVeL6WXQdsAcQTbx


Generated by Claude Code

… (v1.7.0)

Prevents LoRA fine-tuning from silently erasing safety alignment across three
complementary techniques validated on 1B-3B models (toki's target range).

- toki.safety_lora (new module):
    SafetyLoRAConfig β€” four safety fields with defaults (all disabled)
    SploraAuditResult β€” frozen dataclass: flagged_layers, max_ediem, passed, threshold
    LoRATrainResult β€” wraps training_loss + num_steps + optional SploraAuditResult
    load_safety_subspace(path) β€” load safety delta .pt checkpoint (SaLoRA, arXiv 2501.01765)
    freeze_safety_adapter(model, delta) β€” apply + freeze safety params; no-op when None
    _ediem(base, ft) β€” normalised Frobenius distance for E-DIEM approximation
    splora_audit(model, base_state, threshold) β€” post-hoc E-DIEM audit (SPLoRA, arXiv 2506.18931)
    All torch operations behind try-import guards; raises ImportError("toki[hf]") cleanly

- toki.finetune (extended):
    LoRAConfig β€” four new safety fields (safety_lora_rank, safety_subspace_path,
      enable_splora_audit, splora_threshold); all default to disabled (backward compat)
    LoRAFinetuner.train() β€” returns LoRATrainResult; hooks: load+freeze before training,
      E-DIEM audit after training; audit attached to result when enabled
    config_summary() β€” includes safety fields

- CLI: python -m toki finetune --safety-lora-rank --safety-subspace --splora-audit
- pyproject.toml: version 1.7.0
- 44 new tests (24 safety_lora + 15 finetune_extended + 5 CLI)
- 644 total tests passing (600 β†’ 644)

https://claude.ai/code/session_01XCHiLCiVeL6WXQdsAcQTbx
@wesleyscholl wesleyscholl marked this pull request as ready for review June 14, 2026 14:57
@wesleyscholl wesleyscholl merged commit 3a5e474 into main Jun 14, 2026
7 checks passed
@wesleyscholl wesleyscholl deleted the claude/konjo-toki-xPy83 branch June 14, 2026 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants