diff --git a/.docs/plans/issue-394-design-plan.md b/.docs/plans/issue-394-design-plan.md index c60177e4b..93fa126da 100644 --- a/.docs/plans/issue-394-design-plan.md +++ b/.docs/plans/issue-394-design-plan.md @@ -338,3 +338,51 @@ The implementation is broken into 8 incremental steps, each keeping the system d --- **Do you approve this plan as-is, or would you like to adjust any part?** + +--- + +## Implementation Progress + +### Completed Steps ✅ + +#### Step 0: Fix Empty Pattern Bug (CRITICAL - Added during implementation) +- **Date**: 2026-01-04 +- **PR**: #396 +- **Commit**: `f4c60fc1` +- **Problem Discovered**: Empty patterns in thesaurus caused spurious text insertions between every character +- **Root Cause**: Aho-Corasick empty patterns match at every position (index 0, 1, 2, ...) +- **Fix Applied**: + - Added `MIN_PATTERN_LENGTH` constant (2) to filter invalid patterns + - Updated both `find_matches()` and `replace_matches()` in `crates/terraphim_automata/src/matcher.rs` + - Added logging for skipped invalid patterns + - Return original text when no valid patterns exist +- **Tests Added**: 6 comprehensive regression tests: + - `test_empty_pattern_does_not_cause_spurious_insertions` + - `test_single_char_pattern_is_filtered` + - `test_whitespace_only_pattern_is_filtered` + - `test_valid_replacement_still_works` + - `test_empty_thesaurus_returns_original` + - `test_find_matches_filters_empty_patterns` +- **Status**: ✅ Merged to `fix/replacement-empty-pattern-bug` branch + +### Steps 1-8: Original Plan (Pending) + +The original design plan steps for case preservation and URL protection remain pending: +- Step 1: Add `display_value` field to `NormalizedTerm` - ✅ Completed in prior PR +- Step 2: Update `NormalizedTerm::new()` with builder method - ✅ Completed in prior PR +- Step 3: Update `index_inner()` to store original case - ✅ Completed in prior PR +- Step 4: Create `url_protector` module - ✅ Completed in prior PR +- Step 5: Update `replace_matches()` to use display_value - ✅ Completed in prior PR +- Step 6: Integrate URL protection into `replace_matches()` - ✅ Completed in prior PR +- Step 7: Update integration tests - ✅ Completed in prior PR +- Step 8: Verify WASM compatibility - ✅ Verified + +### Bug Discovery Notes + +The empty pattern bug was discovered during implementation when testing the replacement functionality. The symptom was: +- Input: `npm install express` +- Output: `bun install exmatching_and_iterators_in_rustpmatching_and_iterators_in_rustpmatching...` + +Investigation revealed that an empty pattern `""` in the thesaurus was matching at every character boundary, causing the replacement value to be inserted between each character. + +This bug is now fixed and documented for future reference. diff --git a/.docs/summary.md b/.docs/summary.md index 0294d027a..88fd7467c 100644 --- a/.docs/summary.md +++ b/.docs/summary.md @@ -4,7 +4,7 @@ Terraphim AI is a privacy-first, locally-running AI assistant featuring multi-agent systems, knowledge graph intelligence, and secure code execution in Firecracker microVMs. The project combines Rust-based backend services with vanilla JavaScript frontends, emphasizing security, performance, and production-ready architecture. -**Current Status**: v1.0.0 RELEASED - Production-ready with comprehensive multi-language package ecosystem +**Current Status**: v1.4.0 RELEASED - Production-ready with comprehensive multi-language package ecosystem **Primary Technologies**: Rust (async/tokio), Svelte/Vanilla JS, Firecracker VMs, OpenRouter/Ollama LLMs, NAPI, PyO3 **Test Coverage**: 99+ comprehensive tests with 59 passing in main workspace @@ -461,6 +461,17 @@ cd desktop && yarn run check - Code intelligence and security validation - Multi-language support operational +### Recent Bug Fixes (2026-01-04) ✅ + +**Issue #394: Empty Pattern Bug in Text Replacement** +- **Problem**: Empty patterns in thesaurus caused spurious text insertions between every character +- **Symptom**: `npm install express` → `bun install exmatching...pmatching...` +- **Root Cause**: Aho-Corasick empty patterns match at every position (index 0, 1, 2, ...) +- **Fix**: Added `MIN_PATTERN_LENGTH` (2) constant to filter invalid patterns +- **Files Changed**: `crates/terraphim_automata/src/matcher.rs` +- **Tests Added**: 6 comprehensive regression tests +- **PR**: #396 + ### In Progress/Pending 🔄 1. **TruthForge Deployment**: diff --git a/crates/claude-log-analyzer/tests/terraphim_integration_tests.rs b/crates/claude-log-analyzer/tests/terraphim_integration_tests.rs index 88acd104e..4ca8a87af 100644 --- a/crates/claude-log-analyzer/tests/terraphim_integration_tests.rs +++ b/crates/claude-log-analyzer/tests/terraphim_integration_tests.rs @@ -23,6 +23,7 @@ fn create_wrangler_thesaurus() -> Thesaurus { for (pattern, normalized, id) in wrangler_patterns { let normalized_term = NormalizedTerm { + display_value: None, id, value: NormalizedTermValue::from(normalized), url: Some("https://developers.cloudflare.com/workers/wrangler/".to_string()), @@ -71,6 +72,7 @@ fn create_comprehensive_thesaurus() -> Thesaurus { for (pattern, normalized, id, url) in patterns { let normalized_term = NormalizedTerm { + display_value: None, id, value: NormalizedTermValue::from(normalized), url: Some(url.to_string()), diff --git a/crates/terraphim_automata/src/matcher.rs b/crates/terraphim_automata/src/matcher.rs index 1750602e4..b6eccac15 100644 --- a/crates/terraphim_automata/src/matcher.rs +++ b/crates/terraphim_automata/src/matcher.rs @@ -12,12 +12,47 @@ pub struct Matched { pub pos: Option<(usize, usize)>, } +/// Minimum pattern length for find_matches to prevent spurious matches. +const MIN_FIND_PATTERN_LENGTH: usize = 2; + pub fn find_matches( text: &str, thesaurus: Thesaurus, return_positions: bool, ) -> Result> { - let patterns: Vec = thesaurus.keys().cloned().collect(); + // Filter out empty and too-short patterns + let valid_patterns: Vec<(NormalizedTermValue, NormalizedTerm)> = thesaurus + .into_iter() + .filter_map(|(key, value)| { + let pattern_str = key.to_string(); + if pattern_str.trim().is_empty() || pattern_str.len() < MIN_FIND_PATTERN_LENGTH { + log::warn!( + "Skipping invalid pattern in find_matches: {:?} (length {} < {})", + pattern_str, + pattern_str.len(), + MIN_FIND_PATTERN_LENGTH + ); + None + } else { + Some((key.clone(), value.clone())) + } + }) + .collect(); + + if valid_patterns.is_empty() { + log::debug!("No valid patterns for find_matches, returning empty"); + return Ok(Vec::new()); + } + + let patterns: Vec = + valid_patterns.iter().map(|(k, _)| k.clone()).collect(); + let pattern_map: std::collections::HashMap = + valid_patterns.into_iter().collect(); + + log::debug!( + "Building find_matches automaton with {} valid patterns", + patterns.len() + ); let ac = AhoCorasick::builder() .match_kind(MatchKind::LeftmostLongest) @@ -27,7 +62,7 @@ pub fn find_matches( let mut matches: Vec = Vec::new(); for mat in ac.find_iter(text) { let term = &patterns[mat.pattern()]; - let normalized_term = thesaurus + let normalized_term = pattern_map .get(term) .ok_or_else(|| TerraphimAutomataError::Dict(format!("Unknown term {term}")))?; @@ -53,6 +88,12 @@ pub enum LinkType { PlainText, } +/// Minimum pattern length to prevent spurious matches. +/// Patterns shorter than this are filtered out to avoid: +/// - Empty patterns matching at every character position +/// - Single-character patterns causing excessive matches +const MIN_PATTERN_LENGTH: usize = 2; + /// Replace matches in text using the thesaurus. /// /// Uses `display()` method on `NormalizedTerm` to get the case-preserved @@ -60,6 +101,9 @@ pub enum LinkType { /// /// URLs (http, https, mailto, email addresses) are protected from replacement /// to prevent corruption of links. +/// +/// Patterns shorter than MIN_PATTERN_LENGTH (2) are filtered out to prevent +/// spurious matches at every character position. pub fn replace_matches(text: &str, thesaurus: Thesaurus, link_type: LinkType) -> Result> { // Protect URLs from replacement let protector = UrlProtector::new(); @@ -67,40 +111,65 @@ pub fn replace_matches(text: &str, thesaurus: Thesaurus, link_type: LinkType) -> let mut patterns: Vec = Vec::new(); let mut replace_with: Vec = Vec::new(); + for (key, value) in thesaurus.into_iter() { + let pattern_str = key.to_string(); + + // Skip empty or too-short patterns to prevent spurious matches + // Empty patterns match at every character position, causing text like + // "exmatching_and_iterators_in_rustpmatching..." to appear + if pattern_str.trim().is_empty() || pattern_str.len() < MIN_PATTERN_LENGTH { + log::warn!( + "Skipping invalid pattern: {:?} (length {} < {})", + pattern_str, + pattern_str.len(), + MIN_PATTERN_LENGTH + ); + continue; + } + // Use display() to get case-preserved value for output let display_text = value.display(); - match link_type { - LinkType::WikiLinks => { - patterns.push(key.to_string()); - replace_with.push(format!("[[{}]]", display_text)); - } - LinkType::HTMLLinks => { - patterns.push(key.to_string()); - replace_with.push(format!( - "{}", - value.url.as_deref().unwrap_or_default(), - display_text - )); - } - LinkType::MarkdownLinks => { - patterns.push(key.to_string()); - replace_with.push(format!( - "[{}]({})", - display_text, - value.url.as_deref().unwrap_or_default() - )); - } - LinkType::PlainText => { - patterns.push(key.to_string()); - replace_with.push(display_text.to_string()); - } - } + let replacement = match link_type { + LinkType::WikiLinks => format!("[[{}]]", display_text), + LinkType::HTMLLinks => format!( + "{}", + value.url.as_deref().unwrap_or_default(), + display_text + ), + LinkType::MarkdownLinks => format!( + "[{}]({})", + display_text, + value.url.as_deref().unwrap_or_default() + ), + LinkType::PlainText => display_text.to_string(), + }; + + patterns.push(pattern_str); + replace_with.push(replacement); } + + // Validate alignment - patterns and replacements must be 1:1 + debug_assert_eq!( + patterns.len(), + replace_with.len(), + "Pattern/replacement vector mismatch: {} patterns vs {} replacements", + patterns.len(), + replace_with.len() + ); + + // If no valid patterns, return original text unchanged + if patterns.is_empty() { + log::debug!("No valid patterns to replace, returning original text"); + return Ok(text.as_bytes().to_vec()); + } + + log::debug!("Building automaton with {} valid patterns", patterns.len()); + let ac = AhoCorasick::builder() .match_kind(MatchKind::LeftmostLongest) .ascii_case_insensitive(true) - .build(patterns)?; + .build(&patterns)?; // Perform replacement on masked text let replaced = ac.replace_all(&masked_text, &replace_with); @@ -187,3 +256,141 @@ mod paragraph_tests { assert!(!para.contains("Next paragraph")); } } + +#[cfg(test)] +mod replacement_bug_tests { + use super::*; + use terraphim_types::{NormalizedTerm, NormalizedTermValue, Thesaurus}; + + /// Regression test for the bug where empty patterns caused text to be + /// inserted between every character. + /// + /// Bug manifestation: "npm install express" became + /// "bun install exmatching_and_iterators_in_rustpmatching..." + #[test] + fn test_empty_pattern_does_not_cause_spurious_insertions() { + let mut thesaurus = Thesaurus::new("test".to_string()); + + // Simulate the bug: empty pattern with a display value + let bad_nterm = NormalizedTerm::new( + 1, + NormalizedTermValue::from("matching_and_iterators_in_rust"), + ) + .with_display_value("matching_and_iterators_in_rust".to_string()); + thesaurus.insert(NormalizedTermValue::from(""), bad_nterm); + + // Add a valid pattern + let bun_nterm = NormalizedTerm::new(2, NormalizedTermValue::from("bun")) + .with_display_value("bun".to_string()); + thesaurus.insert(NormalizedTermValue::from("npm"), bun_nterm); + + let result = + replace_matches("npm install express", thesaurus, LinkType::PlainText).unwrap(); + let result_str = String::from_utf8(result).unwrap(); + + // Should NOT have spurious insertions between characters + assert_eq!(result_str, "bun install express"); + assert!(!result_str.contains("matching_and_iterators_in_rust")); + } + + #[test] + fn test_single_char_pattern_is_filtered() { + let mut thesaurus = Thesaurus::new("test".to_string()); + + // Single character pattern - should be filtered + let single_char_nterm = NormalizedTerm::new(1, NormalizedTermValue::from("expanded")) + .with_display_value("expanded".to_string()); + thesaurus.insert(NormalizedTermValue::from("e"), single_char_nterm); + + // Valid pattern + let bun_nterm = NormalizedTerm::new(2, NormalizedTermValue::from("bun")) + .with_display_value("bun".to_string()); + thesaurus.insert(NormalizedTermValue::from("npm"), bun_nterm); + + let result = + replace_matches("npm install express", thesaurus, LinkType::PlainText).unwrap(); + let result_str = String::from_utf8(result).unwrap(); + + // Single-char pattern should be filtered, only npm->bun replacement should happen + assert_eq!(result_str, "bun install express"); + assert!(!result_str.contains("expanded")); + } + + #[test] + fn test_whitespace_only_pattern_is_filtered() { + let mut thesaurus = Thesaurus::new("test".to_string()); + + // Whitespace-only pattern - should be filtered + let ws_nterm = NormalizedTerm::new(1, NormalizedTermValue::from("space")) + .with_display_value("space".to_string()); + thesaurus.insert(NormalizedTermValue::from(" "), ws_nterm); + + // Valid pattern + let bun_nterm = NormalizedTerm::new(2, NormalizedTermValue::from("bun")) + .with_display_value("bun".to_string()); + thesaurus.insert(NormalizedTermValue::from("npm"), bun_nterm); + + let result = + replace_matches("npm install express", thesaurus, LinkType::PlainText).unwrap(); + let result_str = String::from_utf8(result).unwrap(); + + assert_eq!(result_str, "bun install express"); + assert!(!result_str.contains("space")); + } + + #[test] + fn test_valid_replacement_still_works() { + let mut thesaurus = Thesaurus::new("test".to_string()); + + // Valid patterns + let bun_nterm = NormalizedTerm::new(1, NormalizedTermValue::from("bun")) + .with_display_value("bun".to_string()); + thesaurus.insert(NormalizedTermValue::from("npm"), bun_nterm); + + let yarn_nterm = NormalizedTerm::new(2, NormalizedTermValue::from("bun")) + .with_display_value("bun".to_string()); + thesaurus.insert(NormalizedTermValue::from("yarn"), yarn_nterm); + + let result = replace_matches( + "npm install && yarn add lodash", + thesaurus, + LinkType::PlainText, + ) + .unwrap(); + let result_str = String::from_utf8(result).unwrap(); + + assert_eq!(result_str, "bun install && bun add lodash"); + } + + #[test] + fn test_empty_thesaurus_returns_original() { + let thesaurus = Thesaurus::new("test".to_string()); + + let result = + replace_matches("npm install express", thesaurus, LinkType::PlainText).unwrap(); + let result_str = String::from_utf8(result).unwrap(); + + assert_eq!(result_str, "npm install express"); + } + + #[test] + fn test_find_matches_filters_empty_patterns() { + let mut thesaurus = Thesaurus::new("test".to_string()); + + // Empty pattern + let empty_nterm = NormalizedTerm::new(1, NormalizedTermValue::from("empty")) + .with_display_value("empty".to_string()); + thesaurus.insert(NormalizedTermValue::from(""), empty_nterm); + + // Valid pattern + let test_nterm = NormalizedTerm::new(2, NormalizedTermValue::from("test")) + .with_display_value("test".to_string()); + thesaurus.insert(NormalizedTermValue::from("hello"), test_nterm); + + let matches = find_matches("hello world", thesaurus, false).unwrap(); + + // Should only find "hello", not empty pattern matches + assert_eq!(matches.len(), 1); + assert_eq!(matches[0].term, "hello"); + } +} diff --git a/docs/src/components/terraphim-automata.md b/docs/src/components/terraphim-automata.md index 56a202308..ea86cccd3 100644 --- a/docs/src/components/terraphim-automata.md +++ b/docs/src/components/terraphim-automata.md @@ -149,6 +149,35 @@ let html_links = matcher::replace_matches(text, thesaurus, Format::Html)?; let markdown_links = matcher::replace_matches(text, thesaurus, Format::Markdown)?; ``` +### Pattern Validation (Important!) + +Both `find_matches` and `replace_matches` automatically filter invalid patterns to prevent issues: + +```rust +/// Minimum pattern length to prevent spurious matches. +/// Patterns shorter than this are filtered out to avoid: +/// - Empty patterns matching at every character position +/// - Single-character patterns causing excessive matches +const MIN_PATTERN_LENGTH: usize = 2; +``` + +**Why this matters**: Empty patterns in Aho-Corasick automata match at every position (index 0, 1, 2, ...), causing spurious text insertions between every character. + +**Example of the bug (now fixed)**: +- Input: `npm install express` +- With empty pattern: `bun install exmatching...pmatching...` (broken!) +- With filtering: `bun install express` (correct!) + +**Automatic filtering includes**: +- Empty strings (`""`) +- Single characters (`"a"`, `"x"`) +- Whitespace-only strings (`" "`) + +Invalid patterns are logged as warnings for debugging: +``` +WARN: Skipping invalid pattern: "" (length 0 < 2) +``` + ### Format Options ```rust pub enum Format { diff --git a/docs/src/history/@lessons-learned.md b/docs/src/history/@lessons-learned.md index be1f74112..67ce159f5 100644 --- a/docs/src/history/@lessons-learned.md +++ b/docs/src/history/@lessons-learned.md @@ -1493,3 +1493,82 @@ The 2-routing workflow bug fix represents the final critical piece in creating a ### **Current System Status: CORE FUNCTIONAL, INFRASTRUCTURE MAINTENANCE NEEDED** ⚡ The Terraphim AI agent system demonstrates strong core functionality with 38+ tests passing, but requires systematic infrastructure maintenance to restore full test coverage and resolve compilation issues across the complete codebase. + +--- + +## Aho-Corasick Empty Pattern Bug Fix (2026-01-04) - CRITICAL BUG LESSONS + +### **Bug Discovery: Issue #394** + +A critical bug was discovered in the text replacement functionality where empty patterns in the thesaurus caused spurious text insertions between every character. + +**Symptom:** +- Input: `npm install express` +- Expected: `bun install express` +- Actual: `bun install exmatching_and_iterators_in_rustpmatching_and_iterators_in_rustpmatching...` + +**Root Cause:** +Empty string patterns (`""`) in Aho-Corasick automata match at every position (index 0, 1, 2, ...) in the input text. When replacement is performed, the replacement value is inserted at every boundary between characters. + +### **Key Lessons Learned** + +**1. Validate Input Patterns Before Building Automata** ⚠️ +- Empty patterns should never be passed to Aho-Corasick builders +- Single-character patterns can cause excessive matches (consider filtering) +- Pattern length validation is a MUST for any text replacement system +- Log warnings when skipping invalid patterns for debugging + +**2. Defensive Programming in Text Processing** 🛡️ +```rust +// Always filter patterns before building automata +if pattern_str.trim().is_empty() || pattern_str.len() < MIN_PATTERN_LENGTH { + log::warn!("Skipping invalid pattern: {:?}", pattern_str); + continue; +} +``` + +**3. Regression Tests Are Essential** 🧪 +- Added 6 comprehensive tests covering: + - Empty pattern filtering + - Single-character pattern filtering + - Whitespace-only pattern filtering + - Valid replacement still works + - Empty thesaurus returns original text + - Both `find_matches` and `replace_matches` filter correctly + +**4. The Bug Cascaded Through Multiple Functions** 🔄 +- Both `find_matches()` and `replace_matches()` were affected +- Had to fix both functions to prevent the issue in all codepaths +- Consistency is key - if one function validates, all similar functions should too + +**5. Constants for Magic Numbers** 📐 +```rust +/// Minimum pattern length to prevent spurious matches +const MIN_PATTERN_LENGTH: usize = 2; +``` +- Makes the code self-documenting +- Easy to adjust if requirements change +- Clear intent for future maintainers + +### **Technical Implementation Details** + +**Fix Location:** `crates/terraphim_automata/src/matcher.rs` + +**Changes Made:** +1. Added `MIN_PATTERN_LENGTH` constant (value: 2) +2. Filter patterns before building automata in `find_matches()` +3. Filter patterns before building automata in `replace_matches()` +4. Return original text unchanged when no valid patterns exist +5. Log warnings for skipped patterns to aid debugging + +**PR:** #396 +**Commit:** `f4c60fc1` + +### **Pattern for Similar Issues** + +When working with pattern matching libraries (Aho-Corasick, regex, etc.): +1. **Validate inputs** - Check for empty/invalid patterns +2. **Handle edge cases** - Empty input, empty patterns, no matches +3. **Test boundaries** - Single char, zero length, maximum length +4. **Log debugging info** - Warn when skipping inputs +5. **Return sensible defaults** - Empty results, original input unchanged