Skip to content

feat(multiturn): Sprint 18 β€” multi-turn jailbreak engine (Crescendo / Echo Chamber) (v1.8.0)#10

Merged
wesleyscholl merged 1 commit into
mainfrom
claude/konjo-toki-lkvusj
Jun 19, 2026
Merged

feat(multiturn): Sprint 18 β€” multi-turn jailbreak engine (Crescendo / Echo Chamber) (v1.8.0)#10
wesleyscholl merged 1 commit into
mainfrom
claude/konjo-toki-lkvusj

Conversation

@konjoinfinity

Copy link
Copy Markdown
Contributor

Sprint 18 β€” Multi-Turn Jailbreak Engine (v1.8.0)

Motivation

Single-turn safety defenses do not transfer to multi-turn attacks. Crescendo (arXiv 2404.01833) reaches 98–100% ASR on frontier models by escalating a benign conversation across turns, each message referencing the model's prior replies. Echo Chamber (2601.05742), GRAF (2506.17881), and AutoAdv (2507.01020) confirm multi-turn is the dominant 2026 attack vector.

Every prior toki module (generator, mutator, indirect, agentic, evaluator) operated on a single prompt β†’ single response. This was the largest blind spot in the coverage map and a prerequisite for the backlog P3-1 dual-agent red-team loop.

What's new β€” toki.multiturn (zero external deps)

  • MultiTurnStrategy enum Β· Turn / Conversation dataclasses (chat-format helpers + scored transcript)
  • MultiTurnConfig / MultiTurnResult β€” save() (timestamped, no overwrite) / load() / to_json()
  • Strategy base + CrescendoStrategy / EchoChamberStrategy β€” deterministic opener β†’ escalation ladder β†’ payload, exactly n_turns messages
  • MultiTurnRunner.run(model_fn) β€” drives a chat-style Callable[[list[dict]], str], scores each reply with the real RuleScorer, stops early on first compliance (Crescendo behaviour)
  • Conversational baselines: safe, unsafe, crescendo (capitulates after benign history builds) + CONV_BASELINES

Integration

  • toki.coverage β€” new "multiturn" category axis bucket (critical severity)
  • CLI: python -m toki multiturn --strategy crescendo|echo_chamber --model safe|unsafe|crescendo --goal --max-turns --seed --success-threshold --output-dir [--json]
  • toki.__init__ exports + __version__ β†’ 1.8.0; pyproject.toml bumped

Live demo

The crescendo-vulnerable baseline holds for two benign turns (score 0.90) then capitulates on turn 2 (0.00 β†’ JAILBROKEN) β€” exactly the failure mode single-turn testing misses.

Verification

  • 675/675 Python tests passing (644 prior + 31 new)
  • New module 99% covered (only the abstract raise NotImplementedError)
  • ruff check / ruff format clean on new files; cargo build + cargo test green
  • PLAN.md + CHANGELOG.md updated

πŸ€– Generated with Claude Code

https://claude.ai/code/session_01WRE1YLhT6aNP4GZT8zbw6q


Generated by Claude Code

… Echo Chamber) (v1.8.0)

Single-turn safety defenses do not transfer to multi-turn attacks. Crescendo
(arXiv 2404.01833) reaches 98-100% ASR by escalating a benign conversation
across turns. Every prior toki module operated on a single prompt β†’ single
response; this closes the largest blind spot in the coverage map and unblocks
the P3-1 dual-agent loop.

- toki.multiturn: Turn/Conversation/MultiTurnConfig/MultiTurnResult dataclasses,
  CrescendoStrategy + EchoChamberStrategy deterministic escalation planners,
  MultiTurnRunner driving a chat-style model_fn with per-turn RuleScorer scoring
  and early-exit on first compliance; safe/unsafe/crescendo conversational
  baselines + CONV_BASELINES registry
- toki.coverage: new "multiturn" category axis bucket (critical severity)
- CLI: python -m toki multiturn --strategy --model --goal --max-turns ...
- toki.__init__ exports; version 1.7.0 β†’ 1.8.0; pyproject bumped
- 31 new tests (28 module + 3 CLI); 675/675 passing; new module 99% covered

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01WRE1YLhT6aNP4GZT8zbw6q
@wesleyscholl wesleyscholl marked this pull request as ready for review June 19, 2026 16:57
@wesleyscholl wesleyscholl merged commit b792454 into main Jun 19, 2026
7 checks passed
@wesleyscholl wesleyscholl deleted the claude/konjo-toki-lkvusj branch June 19, 2026 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants