Skip to content

Project Vision: AI-Native Log Pattern Discovery in the Post-Dashboard Era #2

@STRRL

Description

@STRRL

Context: Why Logs, Why Now

Two tectonic shifts are reshaping how engineers understand production systems:

  1. LLMs can now process unstructured text at scale — what was previously logs' greatest weakness (no schema, no structure) is now a strength. AI excels at extracting meaning from natural language.
  2. The "Logs Are All You Need" thesis is gaining traction — the idea that logs are a superset of metrics and traces (metrics = aggregated events, traces = collections of start/end events, logs = raw events) is converging from multiple independent sources.

Intellectual Lineage

Source Core Argument
Honeycomb / Charity Majors — "Observability 2.0" Build on "arbitrarily-wide structured log events" as a single source of truth. "The bridge from Observability 1.0 to 2.0 is made up of logs, not metrics."
Ivan Burmistrov — "All you need is Wide Events" Traces, logs, and metrics are all special cases of "wide events." Unify on this primitive instead of maintaining three separate pipelines.
Sazabi Manifesto — "Logs Are All You Need" "With logs, you can reconstruct metrics and traces, giving you three 'pillars' for the price of one." AI makes this viable now. Monitoring is dead; agentic anomaly detection is the future.
GreptimeDB "Metrics, logs, and traces aren't separate storage systems but different query views of the same underlying data."

What LAPP Is

LAPP (Log Auto Pattern Pipeline) is a CLI tool that automatically discovers log patterns, semantifies them with LLMs, and provides structured viewing of log streams.

Core Design Principles

  1. Cluster first, LLM second — Drain/Grok/JSON parsers discover templates cheaply (no API cost). LLM is used once per session to semantify templates, not per log line.
  2. Progressive, not perfect — Discover patterns, inspect leftover, iterate. Zero information loss.
  3. Composable — Unix-pipe friendly. Reads stdin, outputs structured results. Fits into existing workflows.

Architecture (Current)

stdin / log file
  │
  ▼
Ingestor (streaming LogLine channel)
  │
  ▼
Parser Chain (first match wins)
  ├─ JSONParser   → detects JSON objects, extracts message/keys
  ├─ GrokParser   → SYSLOG, Apache common/combined patterns
  ├─ DrainParser  → online clustering via go-drain3
  └─ (LLM stub)   → placeholder for future direct LLM parsing
  │
  ▼
DuckDB Store
  ├─ log_entries (line_number, raw, template_id, template)
  └─ (planned) semantic_labels (template_id, semantic_id, description)
  │
  ├──▶ Query CLI (templates, query by template/time)
  ├──▶ Analyzer Agent (eino ADK + OpenRouter, workspace-based)
  └──▶ (planned) Log Viewer (color-coded templates, leftover)

This follows the IBM "Label Broadcasting" pattern: cluster first (90%+ volume reduction), apply LLM to cluster representatives, broadcast labels back. Result: 99.7% reduction in inference cost compared to per-line LLM calls.


Progress

✅ Phase 0: Parser Pipeline (Done)

  • Multi-strategy parser chain: JSON → Grok → Drain (first match wins)
  • Streaming ingestor (file or stdin)
  • DuckDB storage with batch insert
  • Query layer: template summaries, filter by template ID / time range
  • CLI: ingest, templates, query
  • Unit tests for all packages
  • Integration tests across all 14 Loghub-2.0 datasets

✅ Phase 0.5: Agentic Analyzer (Done)

  • Workspace builder: generates summary.txt + errors.txt from parsed logs
  • AI agent via cloudwego/eino ADK + OpenRouter
  • Filesystem tools (grep, read_file, execute) for agent investigation
  • CLI: analyze, debug workspace, debug run

🔲 Phase 1: LLM Semantic Labeling (Next)

Give every discovered template a human-readable semantic ID and description.

  • Add semantic_labels table in DuckDB (template_id → semantic_id, description)
  • After ingest, collect all templates + sample lines
  • Single LLM call: input templates → output semantic IDs and descriptions
  • Store labels in DuckDB, enrich query output
  • Update templates CLI to show semantic info

Before: D1: "Starting <*> on port <*>" / D2: "Connection to <*> timed out after <*> ms"

After: server-startup: "Server starting on a port" / connection-timeout: "DB connection timeout"

🔲 Phase 2: Log Viewer

Color-coded log viewer with template filtering and leftover highlighting.

  • Each semantic template gets a distinct color
  • Filter by template (show only "connection-timeout" logs)
  • Unmatched/leftover logs shown in gray
  • TUI (bubbletea/lipgloss) or simple web UI

🔲 Phase 3: Iterative Refinement

  • Inspect leftover (unmatched) bucket
  • Re-run discovery on leftover lines
  • Add new patterns without losing existing ones
  • Progressive refinement loop

🔲 Phase 4: Substream Intelligence

  • Per-template rate/volume statistics over time
  • Trend detection (is this error pattern increasing?)
  • Anomaly detection (volume spike detection)
  • Time-window queries ("what happened around 14:30?")

🔲 Phase 5: Pipeline & Integration

  • Real-time streaming
  • Pipeline-as-config (YAML/JSON)
  • Integration with log sources (k8s, Docker, journald)
  • Output to structured formats (JSONL, OpenTelemetry)
  • MCP server for LLM agent access

Tech Stack

Component Technology
Language Go
CLI cobra
Log Parsing go-drain3, trivago/grok, JSON detection
Storage DuckDB (via duckdb-go/v2)
LLM Agent cloudwego/eino ADK + OpenRouter
LLM Model google/gemini-3-flash-preview (configurable)

Evolution from Original Prototype

The original prototype (v0.1) used Bun + Vercel AI SDK + Zod with a "LLM generates regex" approach. The current implementation (v0.2+) switched to Go with a "cluster first, LLM second" approach:

Aspect v0.1 (prototype) v0.2+ (current)
Pattern discovery LLM generates regex directly Drain/Grok/JSON clustering (free, fast)
LLM role Generate regex patterns Semantify discovered templates
LLM cost Per-discovery session, all lines sampled One call per session, templates only
Storage JSON files DuckDB (queryable, scalable)
Runtime Bun (TypeScript) Go

The key insight from v0.1 — "use LLM once, then execute cheaply at scale" — is preserved. The LLM's role shifted from "generate regex" to "label templates", which is more cost-effective and reliable.


Differentiation

vs. Classical Log Parsers (Drain3, Logram)

Drain3 produces cryptic templates like User <*> logged in from <*>. LAPP adds an LLM semantification layer: each template gets a human-readable ID and description. Drain does the heavy lifting; LLM provides the understanding.

vs. Commercial Tools (Datadog, Splunk, Elastic)

Dimension Commercial Tools LAPP
Cost $0.10–$1.80+/GB Free, open source
Data residency Logs sent to vendor cloud Logs stay local
Lock-in Patterns locked to vendor DuckDB + portable labels
Setup Full platform deployment cat logs.txt | lapp ingest -

vs. k8sgpt

k8sgpt diagnoses Kubernetes resource issues. LAPP is log-format agnostic — works with any text log from any source.


References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions