feat(multiturn): Sprint 18 β multi-turn jailbreak engine (Crescendo / Echo Chamber) (v1.8.0)#10
Merged
Merged
Conversation
β¦ Echo Chamber) (v1.8.0) Single-turn safety defenses do not transfer to multi-turn attacks. Crescendo (arXiv 2404.01833) reaches 98-100% ASR by escalating a benign conversation across turns. Every prior toki module operated on a single prompt β single response; this closes the largest blind spot in the coverage map and unblocks the P3-1 dual-agent loop. - toki.multiturn: Turn/Conversation/MultiTurnConfig/MultiTurnResult dataclasses, CrescendoStrategy + EchoChamberStrategy deterministic escalation planners, MultiTurnRunner driving a chat-style model_fn with per-turn RuleScorer scoring and early-exit on first compliance; safe/unsafe/crescendo conversational baselines + CONV_BASELINES registry - toki.coverage: new "multiturn" category axis bucket (critical severity) - CLI: python -m toki multiturn --strategy --model --goal --max-turns ... - toki.__init__ exports; version 1.7.0 β 1.8.0; pyproject bumped - 31 new tests (28 module + 3 CLI); 675/675 passing; new module 99% covered Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01WRE1YLhT6aNP4GZT8zbw6q
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Sprint 18 β Multi-Turn Jailbreak Engine (v1.8.0)
Motivation
Single-turn safety defenses do not transfer to multi-turn attacks. Crescendo (arXiv 2404.01833) reaches 98β100% ASR on frontier models by escalating a benign conversation across turns, each message referencing the model's prior replies. Echo Chamber (2601.05742), GRAF (2506.17881), and AutoAdv (2507.01020) confirm multi-turn is the dominant 2026 attack vector.
Every prior toki module (
generator,mutator,indirect,agentic,evaluator) operated on a single prompt β single response. This was the largest blind spot in the coverage map and a prerequisite for the backlog P3-1 dual-agent red-team loop.What's new β
toki.multiturn(zero external deps)MultiTurnStrategyenum Β·Turn/Conversationdataclasses (chat-format helpers + scored transcript)MultiTurnConfig/MultiTurnResultβsave()(timestamped, no overwrite) /load()/to_json()Strategybase +CrescendoStrategy/EchoChamberStrategyβ deterministic opener β escalation ladder β payload, exactlyn_turnsmessagesMultiTurnRunner.run(model_fn)β drives a chat-styleCallable[[list[dict]], str], scores each reply with the realRuleScorer, stops early on first compliance (Crescendo behaviour)safe,unsafe,crescendo(capitulates after benign history builds) +CONV_BASELINESIntegration
toki.coverageβ new"multiturn"category axis bucket (critical severity)python -m toki multiturn --strategy crescendo|echo_chamber --model safe|unsafe|crescendo --goal --max-turns --seed --success-threshold --output-dir [--json]toki.__init__exports +__version__β1.8.0;pyproject.tomlbumpedLive demo
The crescendo-vulnerable baseline holds for two benign turns (score 0.90) then capitulates on turn 2 (0.00 β JAILBROKEN) β exactly the failure mode single-turn testing misses.
Verification
raise NotImplementedError)ruff check/ruff formatclean on new files;cargo build+cargo testgreenπ€ Generated with Claude Code
https://claude.ai/code/session_01WRE1YLhT6aNP4GZT8zbw6q
Generated by Claude Code