Skip to content

feat: add rule-based tool-output compressor for extraction pipeline#308

Merged
win4r merged 1 commit intoCortexReach:masterfrom
AliceLJY:feat/tool-output-compressor
Mar 23, 2026
Merged

feat: add rule-based tool-output compressor for extraction pipeline#308
win4r merged 1 commit intoCortexReach:masterfrom
AliceLJY:feat/tool-output-compressor

Conversation

@AliceLJY
Copy link
Collaborator

Summary

  • Add src/tool-output-compressor.ts: rule-based pre-processor that compresses tool output noise in conversation text before it reaches the extraction LLM
  • Integrate into smart-extractor.ts alongside stripEnvelopeMetadata() — same pattern, minimal change (3 lines)
  • 11 new tests in test/tool-output-compressor.test.mjs, all passing

Motivation

AI coding agents (Claude Code, Codex, Gemini CLI) produce conversations full of tool output noise — git push boilerplate, passing test logs, base64 screenshots, large file dumps. When this text flows into extractAndPersist(), it:

  1. Wastes LLM tokens on worthless content (git push's "Enumerating objects: 5, done." has zero extraction value)
  2. Can confuse weaker extraction LLMs into treating boilerplate as memories
  3. Increases cost per extraction call

Design

Inspired by RTK (Rust Token Killer), which uses 50+ hardcoded rules to compress CLI outputs. We adapted the concept for the extraction pipeline:

  • Zero LLM overhead: pure regex/pattern matching
  • Never modifies user dialog or AI reasoning: only targets tool output blocks (lines starting with $ , , > )
  • Fail-safe: unrecognized outputs pass through unchanged
  • Sits alongside stripEnvelopeMetadata(): same integration pattern

Compression strategies

Type Strategy Safety
Base64 screenshots (>500 chars) Replace with [image: ~NKB base64] placeholder Lossless for extraction purposes
Git push/pull boilerplate Compress to [git: ok main] Failures preserved in full
Passing test logs Compress to [test: 15 passed] Failing tests preserved in full
Large unmatched outputs (>2K chars) Head+tail truncation with [...N chars truncated...] Context preserved at both ends

Benchmark (from downstream testing on 3,584 transcripts)

  • 28.6% size reduction on sampled transcripts
  • Primary savings from skipping streaming entries (945 entries skipped in 20-file sample)
  • Tool result summarization: 48 outputs compressed
  • Large output truncation: 12 outputs truncated

Test plan

  • 11 unit tests covering all compression rules + safety boundaries
  • Existing strip-envelope-metadata tests still pass (12/12)
  • Verified on real Claude Code transcripts downstream (RecallNest fork)

🤖 Generated with Claude Code

Add tool-output-compressor.ts that pre-processes conversation text
before LLM extraction to reduce token costs and improve quality.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@win4r win4r merged commit f7e00af into CortexReach:master Mar 23, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants