feat: add rule-based tool-output compressor for extraction pipeline by AliceLJY · Pull Request #308 · CortexReach/memory-lancedb-pro

AliceLJY · 2026-03-22T12:31:12Z

Summary

Add src/tool-output-compressor.ts: rule-based pre-processor that compresses tool output noise in conversation text before it reaches the extraction LLM
Integrate into smart-extractor.ts alongside stripEnvelopeMetadata() — same pattern, minimal change (3 lines)
11 new tests in test/tool-output-compressor.test.mjs, all passing

Motivation

AI coding agents (Claude Code, Codex, Gemini CLI) produce conversations full of tool output noise — git push boilerplate, passing test logs, base64 screenshots, large file dumps. When this text flows into extractAndPersist(), it:

Wastes LLM tokens on worthless content (git push's "Enumerating objects: 5, done." has zero extraction value)
Can confuse weaker extraction LLMs into treating boilerplate as memories
Increases cost per extraction call

Design

Inspired by RTK (Rust Token Killer), which uses 50+ hardcoded rules to compress CLI outputs. We adapted the concept for the extraction pipeline:

Zero LLM overhead: pure regex/pattern matching
Never modifies user dialog or AI reasoning: only targets tool output blocks (lines starting with $ , ❯ , > )
Fail-safe: unrecognized outputs pass through unchanged
Sits alongside stripEnvelopeMetadata(): same integration pattern

Compression strategies

Type	Strategy	Safety
Base64 screenshots (>500 chars)	Replace with `[image: ~NKB base64]` placeholder	Lossless for extraction purposes
Git push/pull boilerplate	Compress to `[git: ok main]`	Failures preserved in full
Passing test logs	Compress to `[test: 15 passed]`	Failing tests preserved in full
Large unmatched outputs (>2K chars)	Head+tail truncation with `[...N chars truncated...]`	Context preserved at both ends

Benchmark (from downstream testing on 3,584 transcripts)

28.6% size reduction on sampled transcripts
Primary savings from skipping streaming entries (945 entries skipped in 20-file sample)
Tool result summarization: 48 outputs compressed
Large output truncation: 12 outputs truncated

Test plan

11 unit tests covering all compression rules + safety boundaries
Existing strip-envelope-metadata tests still pass (12/12)
Verified on real Claude Code transcripts downstream (RecallNest fork)

🤖 Generated with Claude Code

Add tool-output-compressor.ts that pre-processes conversation text before LLM extraction to reduce token costs and improve quality. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add rule-based tool-output compressor for extraction pipeline

5bdf352

Add tool-output-compressor.ts that pre-processes conversation text before LLM extraction to reduce token costs and improve quality. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

win4r merged commit f7e00af into CortexReach:master Mar 23, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add rule-based tool-output compressor for extraction pipeline#308

feat: add rule-based tool-output compressor for extraction pipeline#308
win4r merged 1 commit intoCortexReach:masterfrom
AliceLJY:feat/tool-output-compressor

AliceLJY commented Mar 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AliceLJY commented Mar 22, 2026

Summary

Motivation

Design

Compression strategies

Benchmark (from downstream testing on 3,584 transcripts)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants