Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2ec9c20838
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if end-start < m.minimumTokenLength { | ||
| return matchContext{} |
There was a problem hiding this comment.
Handle short timestamp prefixes in token graph matching
matchProbability enforces end-start >= minimumTokenLength, but end-start is counting token transitions, not tokens, so this effectively requires one extra token and makes short known prefixes impossible to match. In this commit we explicitly include "11:42:35.173" and "11:42:35,173" in pkg/multiline/detector.go, yet those prefixes tokenize to only 7 tokens (6 transitions), so they never satisfy this gate and IsNewEntry stays false for logs that start with time-only timestamps; those logs will be over-merged into large multiline entries instead of preserving entry boundaries.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Free Tier Details
You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.
To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.
Autofix Details
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Missing schema migration for existing databases
- Added ALTER TABLE statement with IF NOT EXISTS to properly add the end_line_number column to existing databases that were created before the multiline feature.
Or push these changes by commenting:
@cursor push 31a71f14f4
Preview (31a71f14f4)
diff --git a/pkg/store/duckdb.go b/pkg/store/duckdb.go
--- a/pkg/store/duckdb.go
+++ b/pkg/store/duckdb.go
@@ -57,6 +57,11 @@
}
}
+ _, err = s.db.Exec(`ALTER TABLE log_entries ADD COLUMN IF NOT EXISTS end_line_number INTEGER`)
+ if err != nil {
+ return fmt.Errorf("add end_line_number column: %w", err)
+ }
+
return nil
}| return fmt.Errorf("create table: %w", err) | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Missing schema migration for existing databases
Medium Severity
The Init() comment claims it "migrates older schemas by adding missing columns," but the implementation only uses CREATE TABLE IF NOT EXISTS, which does nothing when the table already exists. If a user runs against an existing database created before this change, the end_line_number column won't be added, and all subsequent INSERT and SELECT statements referencing end_line_number will fail with SQL errors.
Add a multiline merger that detects log entry boundaries using Datadog-style timestamp auto-detection (tokenizer + token graph), so stack traces and other multiline entries are combined before parsing. Detection code adapted from Datadog Agent (Apache 2.0, see NOTICE). Merger and CLI wiring are self-written. Pipeline change: Ingestor → multiline.Merge() → parse loop → Store
Covers the full path: ingestor → multiline.Merge → parser chain → DuckDB store → query verification. Replaces the manual test steps in the PR.
…, and endLine tracking - Fall back to line-by-line output when no timestamp is ever detected, preventing silent concatenation of unrelated log lines - Recompute bufBytes from current line after overflow flush instead of carrying stale accumulated size - Check overflow and flush before updating endLine to avoid overlapping line ranges in emitted entries
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| } | ||
|
|
||
| return true | ||
| } |
There was a problem hiding this comment.
Dead production code only used in tests
Low Severity
Both tokensToString and isMatch are defined in tokenizer.go (production code) but are never called from any production code path — the only callers are in tokenizer_test.go. Defining test helpers in non-test files means they are compiled into the production binary unnecessarily and can mislead developers into thinking these functions are part of the active logic. In Go, helpers needed only by tests are conventionally placed in _test.go files so they are excluded from production builds.



Summary
pkg/multiline/package with Datadog-style timestamp auto-detection to identify log entry boundaries (tokenizer → token graph → probability matching against 70+ known timestamp formats)ingest,analyze, anddebug workspaceEndLineNumberfield to store schema to track multiline entry spansTest plan
go test ./pkg/...)go build ./cmd/lapp/)go run ./cmd/lapp/ ingest pkg/multiline/testdata/java_stacktrace.log --db /tmp/test.duckdbNote
Medium Risk
Moderate risk because it changes how logs are chunked and persisted (multiline merging + DB schema change), which can affect downstream parsing/pattern discovery and any existing DuckDB databases or queries.
Overview
Adds first-class multiline log entry detection and merging so stack traces/tracebacks/panic dumps are treated as a single logical entry instead of many lines.
All three CLI flows (
ingest,analyze,debug workspace) now run logs through the newpkg/multilinedetector+merger before parsing/analysis, and ingestion persists merged entries with a newend_line_numbercolumn/EndLineNumberfield to record the original span. The PR also includes extensive unit/integration tests plus aNOTICEfile documenting Apache-licensed Datadog-derived detector components.Written by Cursor Bugbot for commit 87d79b6. This will update automatically on new commits. Configure here.