feat: Agent Improvement Engine (insights, improve, evolve)#765
feat: Agent Improvement Engine (insights, improve, evolve)#765alishakawaguchi wants to merge 9 commits intomainfrom
Conversation
Adds EvolveSettings struct, EvolveConfig field on EntireSettings, and GetEvolveConfig() convenience method with defaults (SessionThreshold=5). Includes mergeJSON support and unit tests covering nil/zero/explicit cases. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Entire-Checkpoint: a9304e803fa1
Create cmd/entire/cli/llmcli with Runner.Execute(), StripGitEnv(), and ExtractJSONFromMarkdown() so the upcoming improve/generator.go can reuse the same CLI invocation logic without duplicating it from summarize. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Entire-Checkpoint: bb2d026600c0
Move terminal color detection, width calculation, token formatting, and lipgloss style construction into a new exported cmd/entire/cli/termstyle package so upcoming renderers (insights, improve) can reuse them without duplicating the logic or importing the cli package. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Entire-Checkpoint: 04021bde7725
After summary generation in CondenseSession(), compute a SessionScore using pure math via insights.ScoreSession/ComputeOverall and return it in CondenseResult. No AI call, no latency impact (<1ms). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Entire-Checkpoint: a7a24df28aa9
…gestion generator Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds the `entire insights` command that reads session quality data from a SQLite cache backed by the entire/checkpoints/v1 branch, then renders scores, trends, and agent comparisons in the terminal or as JSON. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add the evolve package with three responsibilities: threshold-based triggering (ShouldTrigger/IncrementSessionCount/RecordRun), in-memory suggestion lifecycle tracking (Tracker with Accept/Reject/MeasureImpact), and user-facing notification (CheckAndNotify) when the session count meets the configured threshold. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Entire-Checkpoint: a67b9372b855
Adds `entire improve` — a two-phase pipeline that queries recurring friction from the SQLite insights cache, deep-reads transcripts for evidence, detects context files, and calls Claude to generate unified diff suggestions. Also adds JSON tags to insightsdb.FrictionTheme. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Entire-Checkpoint: 64a14f6741b8
…engine - Add context.Context to all insightsdb methods (noctx) - Wrap external package errors (wrapcheck) - Add nolint for tx.Rollback errcheck and maintidx on CondenseSession - Fix nilerr in cache refresh, unparam in renderInsightsTerminal - Add insightsdb cache/db files, insights scoring package - Run go mod tidy for modernc.org/sqlite dependency Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Entire-Checkpoint: b5a4d26c47f7
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
| summaries := sessionRowsToSummaries(rows) | ||
| analysis := improve.AnalyzePatterns(summaries) | ||
| // Overlay the transcript excerpts we fetched into the analysis. | ||
| applyExcerpts(analysis.RepeatedFriction, patterns) |
There was a problem hiding this comment.
Theme mismatch prevents transcript excerpts from being applied
High Severity
applyExcerpts matches dst and src patterns by Theme, but the two slices use incompatible theme formats. buildFrictionPatterns sets Theme to the raw friction text from the database (e.g., "Lint errors not caught by agent"), while AnalyzePatterns sets Theme to a classified keyword (e.g., "lint"). The map lookup in applyExcerpts will never find a match, so transcript excerpts collected in Phase 2 are silently discarded and never included in the prompt sent to Claude.
Additional Locations (1)
| return nil | ||
| } | ||
| return f | ||
| } |
There was a problem hiding this comment.
nullableFloat stores legitimate zero scores as NULL
Medium Severity
nullableFloat converts any zero float to SQL NULL, but the scoring functions (scoreFriction, scoreFocus, scoreFirstPassSuccess) can legitimately produce 0.0 for poor-performing sessions (e.g., 5+ friction items yields ScoreFriction = 0). This conflates "score not yet computed" with "worst possible score" at the database level, making it impossible to distinguish the two states in queries.
| total += totalTokensFromUsage(tu.SubagentTokens) | ||
| } | ||
| return total | ||
| } |
There was a problem hiding this comment.
Duplicate token summation function added in strategy package
Low Severity
totalTokensFromUsage is functionally identical to termstyle.TotalTokens (which was extracted from the old totalTokens in status_style.go in this same PR). Both recursively sum InputTokens + CacheCreationTokens + CacheReadTokens + OutputTokens including subagent tokens. The new function duplicates existing logic rather than reusing it.
There was a problem hiding this comment.
Pull request overview
Adds an “Agent Improvement Engine” to Entire CLI: new insights and improve commands powered by a local SQLite cache, plus an opt-in “evolve” loop to nudge users to run improvements after N sessions.
Changes:
- Introduces
entire insights(session scoring, trends, agent comparisons) backed by a SQLite cache (modernc.org/sqlite). - Introduces
entire improve(recurring friction detection + optional transcript deep-read + Claude CLI suggestions with unified diffs). - Extracts shared utilities into new
llmcli(Claude CLI runner) andtermstyle(terminal styling) packages; adds evolve settings + basic evolve state helpers.
Reviewed changes
Copilot reviewed 40 out of 41 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| go.mod | Adds modernc.org/sqlite dependency (plus indirects). |
| go.sum | Updates module checksums for new dependencies. |
| cmd/entire/cli/termstyle/termstyle.go | New shared terminal styling helpers (color/width detection, rules, token formatting). |
| cmd/entire/cli/termstyle/termstyle_test.go | Unit tests for termstyle utilities. |
| cmd/entire/cli/summarize/claude.go | Refactors summarization Claude invocation to use shared llmcli. |
| cmd/entire/cli/llmcli/llmcli.go | New shared Claude CLI runner with git isolation + markdown JSON extraction. |
| cmd/entire/cli/llmcli/llmcli_test.go | Tests for runner defaults, error paths, git isolation, markdown extraction. |
| cmd/entire/cli/status_style.go | Switches existing status renderer to delegate styling/token helpers to termstyle. |
| cmd/entire/cli/strategy/manual_commit_types.go | Extends CondenseResult with optional SessionScore. |
| cmd/entire/cli/strategy/manual_commit_condensation.go | Computes session score during condensation and returns it in CondenseResult. |
| cmd/entire/cli/strategy/manual_commit_condensation_test.go | Adds tests for token/turn-count helper functions used in scoring. |
| cmd/entire/cli/insightsdb/db.go | New SQLite cache DB open/migrations + table listing for tests. |
| cmd/entire/cli/insightsdb/db_test.go | Tests for DB creation/migrations/idempotency. |
| cmd/entire/cli/insightsdb/cache.go | Cache insert/query helpers and schema mapping structs. |
| cmd/entire/cli/insightsdb/cache_test.go | Tests for cache meta + insert behavior + denormalized fields. |
| cmd/entire/cli/insightsdb/queries.go | Cache query methods (last N, by agent, recurring friction, etc.). |
| cmd/entire/cli/insightsdb/queries_test.go | Tests for query ordering/filtering/recurring friction behavior. |
| cmd/entire/cli/insights/insights.go | New insights domain types (scores, trends, report). |
| cmd/entire/cli/insights/scorer.go | New scoring algorithm + overall weighting. |
| cmd/entire/cli/insights/scorer_test.go | Unit tests for scoring functions. |
| cmd/entire/cli/insights/trends.go | Trend analysis + per-agent comparisons. |
| cmd/entire/cli/insights/trends_test.go | Tests for trends + agent comparisons. |
| cmd/entire/cli/insights_cmd.go | New entire insights command: cache refresh + renderers + score computation. |
| cmd/entire/cli/improve/improve.go | New improve domain types (context files, suggestions, pattern analysis). |
| cmd/entire/cli/improve/context_files.go | Detects known context files and reads content. |
| cmd/entire/cli/improve/context_files_test.go | Tests for context file detection behavior. |
| cmd/entire/cli/improve/analyzer.go | Builds repeated-friction themes + deduped learnings/open items from summaries. |
| cmd/entire/cli/improve/analyzer_test.go | Tests for analyzer theming/deduplication/threshold behavior. |
| cmd/entire/cli/improve/generator.go | Generates context-file suggestions via Claude CLI using llmcli. |
| cmd/entire/cli/improve/generator_test.go | Tests for generator parsing, defaults, IDs, timestamps, error handling. |
| cmd/entire/cli/improve_cmd.go | New entire improve command: friction query, deep-read excerpts, suggestion rendering. |
| cmd/entire/cli/evolve/evolve.go | New evolve state + suggestion record types. |
| cmd/entire/cli/evolve/trigger.go | Trigger logic for session-threshold evolution loop. |
| cmd/entire/cli/evolve/trigger_test.go | Tests for trigger logic/state updates. |
| cmd/entire/cli/evolve/notify.go | Emits user-facing “tip” to run entire improve when threshold reached. |
| cmd/entire/cli/evolve/notify_test.go | Tests for notification behavior. |
| cmd/entire/cli/evolve/tracker.go | In-memory tracker for suggestion lifecycle and simple impact measurement. |
| cmd/entire/cli/evolve/tracker_test.go | Tests for tracker operations. |
| cmd/entire/cli/settings/settings.go | Adds evolve settings + defaulting + JSON merge handling. |
| cmd/entire/cli/settings/settings_evolve_test.go | Tests for evolve settings defaulting/override behavior. |
| cmd/entire/cli/root.go | Registers new insights and improve commands. |
| // Split into first and second halves. | ||
| mid := len(scores) / 2 | ||
| firstHalf := scores[:mid] | ||
| secondHalf := scores[mid:] | ||
|
|
||
| firstAvg := average(firstHalf, extract) | ||
| secondAvg := average(secondHalf, extract) | ||
|
|
There was a problem hiding this comment.
ComputeTrends assumes the input slice is ordered oldest→newest when splitting into halves. In entire insights, sessions are queried ORDER BY created_at DESC, so the newest sessions land in the first half and trend directions will be inverted. Consider sorting by CreatedAt ascending inside ComputeTrends (or explicitly reversing in the caller) before computing averages/data points.
| // buildPrompt constructs the prompt for the Claude CLI. | ||
| // All untrusted content (friction text, learnings, context file content) is wrapped | ||
| // in XML tags to prevent prompt injection. | ||
| func buildPrompt(analysis PatternAnalysis, contextFiles []ContextFile) string { | ||
| var sb strings.Builder | ||
|
|
||
| sb.WriteString(`Analyze recurring patterns from recent AI coding sessions and suggest | ||
| improvements to the project's context files. | ||
|
|
||
| `) | ||
|
|
||
| // Repeated friction section | ||
| sb.WriteString("<repeated_friction>\n") | ||
| if len(analysis.RepeatedFriction) == 0 { | ||
| sb.WriteString("(no repeated friction patterns found)\n") | ||
| } else { | ||
| for _, p := range analysis.RepeatedFriction { | ||
| fmt.Fprintf(&sb, "Theme: %s issues (occurred %d times)\n", p.Theme, p.Count) | ||
| for _, ex := range p.Examples { | ||
| fmt.Fprintf(&sb, " - %q\n", ex) | ||
| } | ||
| if p.TranscriptExcerpt != "" { | ||
| fmt.Fprintf(&sb, " Excerpt: %q\n", p.TranscriptExcerpt) | ||
| } | ||
| } |
There was a problem hiding this comment.
The prompt-building comment says untrusted content is wrapped in XML tags to prevent prompt injection, but the inserted friction/learnings/context contents are not escaped. A friction string containing < / </repeated_friction> could break the structure and undermine the mitigation. Consider escaping/encoding untrusted strings (e.g., XML-escape or JSON-marshal values) and updating the comment to reflect the actual guarantees.
| suggestions := make([]Suggestion, 0, len(resp.Suggestions)) | ||
| for i, s := range resp.Suggestions { | ||
| suggestions = append(suggestions, Suggestion{ | ||
| ID: fmt.Sprintf("sug-%d-%d", now.Unix(), i), |
There was a problem hiding this comment.
Suggestion IDs are based on now.Unix() (seconds) plus the loop index. Two entire improve runs within the same second can generate identical IDs (especially if they produce the same number of suggestions), which will collide with the suggestions.id primary key in SQLite. Consider using UnixNano, a UUID, or a random suffix for IDs.
| ID: fmt.Sprintf("sug-%d-%d", now.Unix(), i), | |
| ID: fmt.Sprintf("sug-%d-%d", now.UnixNano(), i), |
| // Build repeated friction list (threshold: 2+ occurrences) | ||
| var repeated []FrictionPattern | ||
| for theme, acc := range byTheme { | ||
| if acc.count < 2 { | ||
| continue | ||
| } | ||
| sessions := make([]string, 0, len(acc.sessions)) | ||
| for id := range acc.sessions { | ||
| sessions = append(sessions, id) | ||
| } | ||
| repeated = append(repeated, FrictionPattern{ | ||
| Theme: theme, | ||
| Count: acc.count, | ||
| Examples: acc.examples, | ||
| AffectedSessions: sessions, | ||
| }) | ||
| } |
There was a problem hiding this comment.
AnalyzePatterns builds RepeatedFriction by iterating over a map, so the order of patterns (and therefore CLI output/prompt content) is nondeterministic across runs. This can lead to noisy diffs in generated suggestions and make behavior harder to reason about. Consider sorting repeated deterministically (e.g., by Count desc, then Theme asc) before returning.
| row.InputTokens = meta.TokenUsage.InputTokens + meta.TokenUsage.CacheCreationTokens + meta.TokenUsage.CacheReadTokens | ||
| row.OutputTokens = meta.TokenUsage.OutputTokens | ||
| row.TotalTokens = row.InputTokens + row.OutputTokens |
There was a problem hiding this comment.
metadataToSessionRow populates InputTokens with input+cache tokens but never sets CacheTokens, even though the DB schema and SessionRow struct have a dedicated cache_tokens column. This makes cached token breakdown inconsistent and prevents showing cache token usage accurately. Suggest setting InputTokens to just input tokens, CacheTokens to cache creation+read, and TotalTokens to input+cache+output.
| row.InputTokens = meta.TokenUsage.InputTokens + meta.TokenUsage.CacheCreationTokens + meta.TokenUsage.CacheReadTokens | |
| row.OutputTokens = meta.TokenUsage.OutputTokens | |
| row.TotalTokens = row.InputTokens + row.OutputTokens | |
| // Keep token categories distinct so cache usage can be reported accurately. | |
| row.InputTokens = meta.TokenUsage.InputTokens | |
| row.CacheTokens = meta.TokenUsage.CacheCreationTokens + meta.TokenUsage.CacheReadTokens | |
| row.OutputTokens = meta.TokenUsage.OutputTokens | |
| row.TotalTokens = row.InputTokens + row.CacheTokens + row.OutputTokens |
| // nullableFloat converts a zero float to a SQL NULL value. | ||
| // Non-zero floats are passed through as-is. | ||
| func nullableFloat(f float64) interface{} { | ||
| if f == 0 { | ||
| return nil | ||
| } |
There was a problem hiding this comment.
nullableFloat converts 0.0 to SQL NULL. For score fields, 0 can be a valid computed value (e.g., severe friction), so this loses information in the DB and can change behavior for any future SQL aggregates/filters. Prefer inserting 0 as 0 and using an explicit “score computed” flag/column if you need to represent “unknown”.
| // nullableFloat converts a zero float to a SQL NULL value. | |
| // Non-zero floats are passed through as-is. | |
| func nullableFloat(f float64) interface{} { | |
| if f == 0 { | |
| return nil | |
| } | |
| // nullableFloat passes floats through as-is so that 0.0 is preserved as a valid value. | |
| func nullableFloat(f float64) interface{} { |
| // QuerySessionsWithFriction returns checkpoint IDs of sessions containing | ||
| // friction matching the given SQL LIKE pattern (e.g., "%tool call failed%"). | ||
| func (idb *InsightsDB) QuerySessionsWithFriction(ctx context.Context, pattern string) ([]string, error) { | ||
| rows, err := idb.db.QueryContext(ctx, | ||
| "SELECT DISTINCT checkpoint_id FROM friction WHERE text LIKE ?", | ||
| pattern, | ||
| ) |
There was a problem hiding this comment.
QuerySessionsWithFriction returns only checkpoint IDs, dropping session_index even though the friction table is keyed by (checkpoint_id, session_index). Callers (e.g., transcript deep-read) can end up reading the wrong session (or always index 0) and miss the friction evidence. Consider returning (checkpoint_id, session_index) pairs (or a small struct) and adjusting callers accordingly.
| // Phase 2: Deep-read transcripts for top friction themes. | ||
| patterns := buildFrictionPatterns(frictionThemes) | ||
| if err = attachTranscriptExcerpts(ctx, idb, patterns, worktreeRoot); err != nil { | ||
| // Non-fatal: proceed without transcript excerpts. | ||
| _ = err | ||
| } | ||
|
|
||
| // Phase 3: Detect context files. | ||
| contextFiles := improve.DetectContextFiles(worktreeRoot) | ||
|
|
||
| // Phase 4: Build analysis from session data + friction patterns, then generate. | ||
| summaries := sessionRowsToSummaries(rows) | ||
| analysis := improve.AnalyzePatterns(summaries) | ||
| // Overlay the transcript excerpts we fetched into the analysis. | ||
| applyExcerpts(analysis.RepeatedFriction, patterns) |
There was a problem hiding this comment.
Phase 2/4 excerpt wiring looks broken: patterns := buildFrictionPatterns(frictionThemes) sets Theme to the raw friction text, but AnalyzePatterns produces RepeatedFriction themes like "lint", "test", etc. applyExcerpts matches by Theme, so transcript excerpts will almost never attach to the analysis passed into Generator.Generate. Consider running attachTranscriptExcerpts directly on analysis.RepeatedFriction (or ensuring both phases use the same theme key).
| // Phase 2: Deep-read transcripts for top friction themes. | |
| patterns := buildFrictionPatterns(frictionThemes) | |
| if err = attachTranscriptExcerpts(ctx, idb, patterns, worktreeRoot); err != nil { | |
| // Non-fatal: proceed without transcript excerpts. | |
| _ = err | |
| } | |
| // Phase 3: Detect context files. | |
| contextFiles := improve.DetectContextFiles(worktreeRoot) | |
| // Phase 4: Build analysis from session data + friction patterns, then generate. | |
| summaries := sessionRowsToSummaries(rows) | |
| analysis := improve.AnalyzePatterns(summaries) | |
| // Overlay the transcript excerpts we fetched into the analysis. | |
| applyExcerpts(analysis.RepeatedFriction, patterns) | |
| // Phase 2: (deprecated wiring) Deep-read transcripts will be attached after analysis is built. | |
| // Phase 3: Detect context files. | |
| contextFiles := improve.DetectContextFiles(worktreeRoot) | |
| // Phase 4: Build analysis from session data, then attach transcript excerpts and generate. | |
| summaries := sessionRowsToSummaries(rows) | |
| analysis := improve.AnalyzePatterns(summaries) | |
| // Attach transcript excerpts directly to the repeated friction patterns used in generation. | |
| if err = attachTranscriptExcerpts(ctx, idb, analysis.RepeatedFriction, worktreeRoot); err != nil { | |
| // Non-fatal: proceed without transcript excerpts. | |
| _ = err | |
| } |
| // truncateString truncates s to at most maxLen bytes, appending "..." if truncated. | ||
| func truncateString(s string, maxLen int) string { | ||
| if len(s) <= maxLen { | ||
| return s | ||
| } | ||
| return s[:maxLen] + "..." | ||
| } |
There was a problem hiding this comment.
truncateString slices by bytes (s[:maxLen]), which can cut multi-byte UTF-8 sequences (common in transcripts) and produce invalid text. Consider truncating by runes (or using utf8.ValidString / []rune), and ensure the ellipsis doesn’t exceed the requested limit if that matters.
| func (g *Generator) Generate(ctx context.Context, analysis PatternAnalysis, contextFiles []ContextFile) ([]Suggestion, error) { | ||
| prompt := buildPrompt(analysis, contextFiles) | ||
|
|
||
| raw, err := g.Runner.Execute(ctx, prompt) |
There was a problem hiding this comment.
Generator.Generate assumes g.Runner is non-nil and will panic if a caller constructs Generator{}. Since Generator is exported, consider defaulting g.Runner to &llmcli.Runner{} when nil (or returning a clear error) to avoid panics in other packages/tests.
| raw, err := g.Runner.Execute(ctx, prompt) | |
| runner := g.Runner | |
| if runner == nil { | |
| runner = &llmcli.Runner{} | |
| } | |
| raw, err := runner.Execute(ctx, prompt) |


Summary
Adds three new features to help users improve their AI coding sessions based on collected session data:
entire insights— Session quality scoring, cross-session trends, and agent comparisons. SQLite-cached, <1s response time.entire improve— Two-phase friction analysis (SQLite index → transcript deep-read → Claude CLI) that generates context file improvement suggestions (CLAUDE.md, AGENTS.md, .cursorrules, .gemini) with evidence and unified diffs.Architecture
.entire/insights.db) — local analytics cache usingmodernc.org/sqlite(pure Go, CGO_ENABLED=0 compatible). Populated fromentire/checkpoints/v1branch with staleness detection.insights/scores/on the checkpoint branch for future frontend consumption.llmclipackage — Common Claude CLI execution extracted fromsummarize/claude.go. Both summarize and improve use it with different prompts.termstylepackage — Shared terminal styling extracted fromstatus_style.goto avoid duplication across renderers.New packages
cmd/entire/cli/termstyle/cmd/entire/cli/llmcli/cmd/entire/cli/insightsdb/cmd/entire/cli/insights/cmd/entire/cli/improve/cmd/entire/cli/evolve/Key decisions
IsSummarizeEnabled()evolve.enabled: falseby default)modernc.org/sqlite(32MB → 40MB)Test plan
mise run test:ci— all unit + integration tests passentire insightsrenders scores, trends, and agent comparisonsentire improve --dry-runshows friction patterns without AI callentire improvegenerates context file suggestions.entire/insights.dbis created and populated on first runCondenseResult.SessionScore)🤖 Generated with Claude Code
Note
Medium Risk
Adds new CLI commands plus a new SQLite cache and hooks into session condensation to compute/store quality scores, which could impact core session/metadata workflows and local disk state. Also introduces Claude CLI execution via a shared runner and transcript reads, increasing integration surface with external tooling.
Overview
Introduces a new analytics and improvement workflow:
entire insightscomputes per-session quality scores, trend metrics, and agent comparisons from a local SQLite cache, with both terminal and--jsonoutput.Adds
entire improveto analyze recent sessions for recurring friction (SQLite index + optional transcript deep-read) and then call the Claude CLI to generate context-file suggestions (with evidence and unified diffs), including a--dry-runmode that skips AI/transcript reads.Adds an opt-in evolution loop (
settings.evolve) to track sessions since the last improvement run and print a tip prompting users to runentire improveafter a configurable threshold; also refactors shared infrastructure by extractingllmcli(Claude CLI runner + git isolation) andtermstyle(shared lipgloss styling), and adds themodernc.org/sqlitedependency for the new cache.Written by Cursor Bugbot for commit 7816386. Configure here.