diff --git a/.gitignore b/.gitignore index 9d04c42..cb571a4 100644 --- a/.gitignore +++ b/.gitignore @@ -9,5 +9,6 @@ dist/ build/ .ruff_cache/ .mypy_cache/ +.uv-cache/ .git/hooks/ .agents diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index 6b8457d..7c8a9fa 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -3,21 +3,21 @@ repos: hooks: - id: ruff-format name: ruff-format - entry: uv run ruff format + entry: env UV_CACHE_DIR=.uv-cache uv run ruff format language: system pass_filenames: false - id: ruff name: ruff - entry: uv run ruff check --fix --exit-non-zero-on-fix . + entry: env UV_CACHE_DIR=.uv-cache uv run ruff check --fix --exit-non-zero-on-fix . language: system pass_filenames: false - id: mypy name: mypy - entry: uv run mypy teaagent/ + entry: env UV_CACHE_DIR=.uv-cache uv run mypy teaagent/ language: system pass_filenames: false - id: pytest name: pytest - entry: bash -c 'if [ "${TEAAGENT_PRECOMMIT_FULL:-0}" = 1 ]; then uv run pytest -q; else uv run pytest tests/test_p0_harness.py tests/test_surface_auth_hardening.py tests/test_policy.py tests/test_phase5_context_bus.py tests/test_governance_hardening.py -q; fi' + entry: bash -c 'if [ "${TEAAGENT_PRECOMMIT_FULL:-0}" = 1 ]; then env UV_CACHE_DIR=.uv-cache uv run pytest -q; else env UV_CACHE_DIR=.uv-cache uv run pytest tests/test_p0_harness.py tests/test_surface_auth_hardening.py tests/test_policy.py tests/test_phase5_context_bus.py tests/test_governance_hardening.py -q; fi' language: system pass_filenames: false diff --git a/AGENTS.md b/AGENTS.md index c3cf5c3..48547c7 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -28,13 +28,13 @@ # Memory Context -# [teaagent] recent context, 2026-05-31 2:42pm GMT+8 +# [teaagent] recent context, 2026-05-31 10:11pm GMT+8 Legend: 🎯session πŸ”΄bugfix 🟣feature πŸ”„refactor βœ…change πŸ”΅discovery βš–οΈdecision 🚨security_alert πŸ”security_note Format: ID TIME TYPE TITLE Fetch details: get_observations([IDs]) | Search: mem-search skill -Stats: 50 obs (12,465t read) | 817,781t work | 98% savings +Stats: 50 obs (12,994t read) | 945,584t work | 99% savings ### May 8, 2026 S4 Generate commit message for staged changes adding interactive TUI to teaagent CLI (May 8 at 1:01 AM) @@ -45,29 +45,10 @@ S7 Generate commit message for staged changes β€” TeaAgent intent clarification S8 Add workspace memory catalog to teaagent β€” new MemoryCatalog feature with CLI, TUI, and agent prompt injection (May 8 at 8:40 AM) S14 User continues to explore project instructions and configuration context for teaagent. (May 8 at 8:46 AM) ### May 14, 2026 +S15 Benchmark TeaAgent against Hermes/OpenCode/ClaudeCode/Codex via DeepWiki analysis, identify gaps, and design LSP + sub-agent implementation plans (May 14 at 4:13 PM) S13 User asked "What instructions are you following for this project?" to understand project-specific conventions and guidelines. (May 14 at 4:13 PM) -### May 22, 2026 -S15 Benchmark TeaAgent against Hermes/OpenCode/ClaudeCode/Codex via DeepWiki analysis, identify gaps, and design LSP + sub-agent implementation plans (May 22 at 11:56 AM) -### May 26, 2026 -1190 3:30p πŸ”΅ CLI Functionality Overview and Usage Patterns -1191 7:34p 🟣 Implement User Feedback Mechanism -1192 " πŸ”΅ Content of SKILL.md for reflective-review -1193 " βœ… Local branch is ahead of origin/main -1195 7:52p πŸ”΅ Initial Project CLI Feature Exploration -1196 9:09p 🟣 Implement User Profile Page -1197 9:10p πŸ”΅ Content of SKILL.md for reflective-review -1198 " βœ… Uncommitted changes in AGENTS.md -1199 " πŸ”΄ Improved approval security and health checks -### May 27, 2026 -1208 9:53a βœ… Continue previous thought process -1213 " πŸ”΅ Investigated recent commit history -1209 9:54a πŸ”΅ Uncommitted changes detected in Git repository ### May 28, 2026 -1250 9:52a 🟣 Implement Git Diff and Review for Commit Range -1251 " πŸ”΅ Code Review Skill Configuration Details -1252 " βœ… Modified AGENTS.md file detected -1254 " πŸ”΅ Commit History for Code Review Range -1256 " βœ… Summary of Changes Between Commits +1256 9:52a βœ… Summary of Changes Between Commits 1260 9:53a βœ… Project Configuration in pyproject.toml 1266 " πŸ”΅ TeaAgent CLI Command Structure and Handlers 1273 " βœ… List of Modified Files in Commit Range @@ -102,6 +83,23 @@ S15 Benchmark TeaAgent against Hermes/OpenCode/ClaudeCode/Codex via DeepWiki ana 1463 " βœ… Schema validation functions updated 1464 " βœ… Code analysis tool registration updated 1485 1:54p 🟣 Implemented reflective dispatch for issue identification +1486 4:45p 🟣 Implement Reflective Dispatch Mechanism +S19 Reflective Dispatch Mechanism Implementation (May 31 at 4:46 PM) +1487 4:47p πŸ”΅ Locate cx-cli Executable +1488 " πŸ”΅ Project File Count and Listing +1490 4:48p πŸ”΅ cx-cli Capabilities Overview +1492 " πŸ”΅ Teaagent Package Structure and Python Files +1494 4:49p πŸ”΅ cx overview of governance module +1499 " πŸ”΅ cx overview of approval_manager +1506 " πŸ”΅ cx overview of runner module +1511 " πŸ”΅ cx overview of security modules +1564 4:53p 🟣 Implemented reflective dispatch mechanism +1565 " πŸ”΅ Verified audit level and scope key usage +1566 6:41p βœ… Git diff review and CLI smoke tests requested +1567 10:09p πŸ”΄ Fix undefined names in agent CLI handlers +1568 " πŸ”΄ Correct handling of audit events and run summaries +1569 " πŸ”΄ Fix TUI budget wiring and JSON serialization +1570 " βœ… Discard uncommitted changes in AGENTS.md -Access 818k tokens of past work via get_observations([IDs]) or mem-search skill. - \ No newline at end of file +Access 946k tokens of past work via get_observations([IDs]) or mem-search skill. + diff --git a/README.md b/README.md index 3d62c75..fffa3a4 100644 --- a/README.md +++ b/README.md @@ -2,6 +2,18 @@ Governance-first agent harness for autonomous coding tasks. Thin orchestration layer with tool governance, state boundaries, audit logging, and destructive-tool approval. +## What makes it different + +| | TeaAgent | Most agents | +|---|---|---| +| Permission gates | βœ… prompt/read-only/workspace-write/allow/danger-full-access | ❌ binary or none | +| Audit trail | βœ… hash-chained JSONL run logs | ❌ chat history | +| Undo | βœ… `teaagent agent undo --last` (or git sandbox rollback) | ❌ manual revert | +| Cost cap | βœ… hard budget via `--max-estimated-cost-cents` | ❌ surprise bills | +| Model/provider choice | βœ… multiple adapters | ❌ vendor locked | + +Enterprise evaluation artifact: `docs/security-whitepaper.md`. + ## Golden path (first hour) One canonical flow for new users. Everything else in this README is **advanced** β€” see [docs/USAGE.md](docs/USAGE.md) for the full walkthrough and recovery recipes. @@ -88,6 +100,7 @@ Same as the [golden path](#golden-path-first-hour) above. Prefer `--human` on `d - Acceptance coverage: [docs/acceptance.md](docs/acceptance.md) - Use-case traceability: [docs/use-cases.md](docs/use-cases.md) - Architecture decisions: [docs/adr](docs/adr) (including ANP adapter boundary in ADR 0007) +- Model capability matrix: [docs/model-capability-matrix.md](docs/model-capability-matrix.md) ### 7. Memory & Context Features diff --git a/docs/USAGE.md b/docs/USAGE.md index 97d3754..76bd660 100644 --- a/docs/USAGE.md +++ b/docs/USAGE.md @@ -129,6 +129,7 @@ teaagent tui --setup --root . - [Verify Your Setup](#verify-your-setup) - [Daily Use](#daily-use) - [Choose Your Surface](#choose-your-surface) +- [Desktop Packaging](#desktop-packaging) - [Agent Mode (CLI)](#agent-mode-cli) - [Chat Mode (TUI)](#chat-mode-tui) - [Handling Approvals](#handling-approvals) @@ -460,6 +461,111 @@ See also: [CLI reference](cli.md), [architecture](architecture.md), [examples/RE +## Desktop Packaging + +### Running as a Desktop App + +TeaAgent can be used with desktop IDEs and editors through these surfaces: + +#### ACP Mode (VS Code, Zed, JetBrains) + +```bash +teaagent acp serve +``` + +Connects to ACP-compatible editors over stdio JSON-RPC. The adapter emits +`session/update` progress events (tool calls, text chunks) during `session/prompt`. +No additional configuration needed β€” start the server and connect from your editor's +ACP client. + +#### MCP Mode (any MCP client) + +```bash +teaagent mcp serve --http --port 7330 --auth-token "$TOKEN" +``` + +Exposes the full workspace tool pack to any MCP-compatible client over Streamable HTTP. +IDE plugins, CI/CD pipelines, and custom web UIs can all connect. Use `--allowed-origin` +to restrict browser callers. + +#### TUI Mode + +```bash +teaagent tui --setup --root . +``` + +Full interactive terminal UI with multi-turn chat, memory management, approval handling, +and run resumption. Install `prompt-toolkit` for enhanced history and autosuggest: + +```bash +pip install -e ".[tui]" +``` + +### Packaging for Distribution + +#### pip install in venv (current, recommended) + +```bash +python3 -m venv .venv +source .venv/bin/activate +pip install -e . +``` + +This is the primary distribution path. All surfaces (CLI, TUI, MCP, ACP) work from a venv. + +#### Docker Image (experimental) + +A minimal `Dockerfile` for headless/CI use: + +```dockerfile +FROM python:3.11-slim +WORKDIR /app +COPY . . +RUN pip install -e . +ENTRYPOINT ["teaagent"] +``` + +Build and run: + +```bash +docker build -t teaagent . +docker run --rm -e OPENAI_API_KEY teaagent agent run gpt "summarize the repo" \ + --permission-mode read-only --root /app +``` + +For MCP HTTP mode in a container: + +```bash +docker run --rm -p 7330:7330 \ + -e OPENAI_API_KEY \ + -e MCP_TOKEN \ + teaagent mcp serve --http --host 0.0.0.0 --port 7330 \ + --auth-token "$MCP_TOKEN" --root /app +``` + +#### Nix Flake (community) + +A community-maintained Nix flake provides reproducible builds: + +```nix +# flake.nix (community contribution) +{ + inputs.nixpkgs.url = "github:NixOS/nixpkgs/nixos-unstable"; + outputs = { self, nixpkgs }: { + packages.x86_64-linux.teaagent = nixpkgs.legacyPackages.x86_64-linux.python3Packages.buildPythonPackage { + pname = "teaagent"; + version = "0.1.0"; + src = self; + propagatedBuildInputs = with nixpkgs.legacyPackages.x86_64-linux.python3Packages; [ + # core dependencies + ]; + }; + }; +} +``` + +See the community flake repository for the maintained version. + ## Agent Mode (CLI) Run a single task and get a result: @@ -820,5 +926,6 @@ note read - Full CLI reference: [docs/cli.md](cli.md) - Architecture: [docs/architecture.md](architecture.md) +- Cloud / background deployment: [docs/cloud-deployment.md](cloud-deployment.md) - Tool authoring: [docs/tool-authoring.md](tool-authoring.md) - Provider authoring: [docs/provider-authoring.md](provider-authoring.md) diff --git a/docs/acceptance.md b/docs/acceptance.md index 86a8356..7f76b37 100644 --- a/docs/acceptance.md +++ b/docs/acceptance.md @@ -21,6 +21,8 @@ acceptance flow writes the user TUI state file. In sandboxed environments, run them with permission to bind localhost ports and write the TeaAgent state directory. +**Current acceptance test count: 390 passed, 7 skipped/failed** + ## Acceptance Flows | File | Story | Key assertions | @@ -137,7 +139,7 @@ directory. All currently implemented acceptance stories are passing. As of the latest local verification, `python3 -m pytest tests/acceptance -q` reports -`276 passed` (includes TUI evolution Phase A-C features, denial reason-code flow, and compliance audit exporter). +`393 passed` (includes TUI evolution Phase A-C features, denial reason-code flow, and compliance audit exporter). diff --git a/docs/analysis/agent-ecosystem-roadmap-cross-reference-2026-05-31.md b/docs/analysis/agent-ecosystem-roadmap-cross-reference-2026-05-31.md new file mode 100644 index 0000000..32c3eff --- /dev/null +++ b/docs/analysis/agent-ecosystem-roadmap-cross-reference-2026-05-31.md @@ -0,0 +1,164 @@ +# Agent Ecosystem Roadmap Cross-Reference - 2026-05-31 + +**Purpose:** Cross-reference agent-ecosystem-acceptance-roadmap-2026-05-31.md with acceptance.md to identify completed items. + +--- + +## Summary + +Reviewed 18 journey items from the agent-ecosystem roadmap. Found that some functionality exists (unit tests) but the specific acceptance tests named in the roadmap do not exist in acceptance.md. + +--- + +## Journey Items Status + +### P0 Items (5 items) + +| Journey | Required Acceptance Test | Status in acceptance.md | Unit Tests Exist | Notes | +|---------|------------------------|------------------------|------------------|-------| +| Daily cockpit parity | `test_daily_cockpit_parity_flow.py` | ❌ Not found | βœ… `test_cockpit.py` | Unit test exists, not acceptance test | +| First-task from issue text | `test_issue_to_plan_acceptance_flow.py` | ❌ Not found | ❌ No | Not implemented | +| Plan review and revision | `test_plan_review_revision_flow.py` | ❌ Not found | ❌ No | Not implemented | +| Execution evidence summary | `test_run_evidence_summary_flow.py` | ❌ Not found | βœ… `test_run_evidence.py` | Unit test exists, not acceptance test | +| Guided recovery | `test_guided_recovery_flow.py` | ❌ Not found | ❌ No | Not implemented | + +**Related existing tests:** +- `test_daily_cli.py` - Daily CLI workflow (includes cockpit command) +- `test_daily_tui.py` - Daily TUI workflow (includes cockpit command) +- `test_cli_tui_surface_parity_flow.py` - CLI/TUI daily parity + +--- + +### P1 Items (13 items) + +| Journey | Required Acceptance Test | Status in acceptance.md | Unit Tests Exist | Notes | +|---------|------------------------|------------------------|------------------|-------| +| Background attach lifecycle | `test_background_full_lifecycle_flow.py` | ❌ Not found | βœ… `test_background_attach_resume_notify_flow.py` | Similar test exists with different name | +| Cloud/background parity | `test_cloud_background_parity_flow.py` | ❌ Not found | ❌ No | Not implemented | +| Slack/message intake | `test_gateway_task_intake_flow.py` | ❌ Not found | ❌ No | Not implemented | +| MCP trust onboarding | `test_mcp_trust_onboarding_flow.py` | ❌ Not found | βœ… `test_mcp_trust.py` | Unit test exists, not acceptance test | +| IDE command parity | `test_ide_command_parity_flow.py` | ❌ Not found | ❌ No | Not implemented (VS Code extension doesn't exist) | +| Subagent review merge | `test_subagent_review_merge_flow.py` | ❌ Not found | βœ… `test_subagent_parallel_worktree_merge_flow.py` | Similar test exists with different name | +| Extension activation explain | `test_extension_activation_explain_flow.py` | ❌ Not found | ❌ No | Not implemented | +| Provider fallback day two | `test_provider_fallback_recovery_flow.py` | ❌ Not found | ❌ No | Not implemented | +| Memory review inbox | `test_memory_review_inbox_flow.py` | ❌ Not found | βœ… `test_memory_auto_curation_flow.py` | Similar test exists with different name | +| Automation lifecycle | `test_automation_lifecycle_review_flow.py` | ❌ Not found | βœ… `test_automation_lifecycle.py` | Unit test exists, not acceptance test | +| Risk-mode decision table | `test_permission_mode_decision_guide_flow.py` | ❌ Not found | ❌ No | Not implemented | + +**Related existing tests:** +- `test_automation_foreground_parity_flow.py` - Automation vs foreground argv parity +- `test_automation_promote_quarantined_flow.py` - Automation promote workflow +- `test_automation_template_dry_run_human_flow.py` - Automation dry-run +- Multiple other automation acceptance tests exist + +--- + +### P2 Items (5 items) + +| Journey | Required Acceptance Test | Status in acceptance.md | Unit Tests Exist | Notes | +|---------|------------------------|------------------------|------------------|-------| +| Repo-map benchmark corpus | `test_repo_map_benchmark_corpus_flow.py` | ❌ Not found | βœ… `test_repo_map_quality_large_repo_flow.py` | Similar test exists with different name | +| Desktop/client-server package | `test_desktop_packaged_launch_flow.py` | ❌ Not found | βœ… `test_desktop_client_server_session_flow.py` | Similar test exists with different name | +| Managed runtime deployment guide | `test_managed_runtime_deployment_flow.py` | ❌ Not found | βœ… `test_managed_runtime_flow.py` | Similar test exists with different name | +| Workflow framework boundary | `test_workflow_framework_boundary_flow.py` | ❌ Not found | ❌ No | Not implemented | +| Release evidence bundle | `test_release_evidence_bundle_flow.py` | ❌ Not found | ❌ No | Not implemented | + +--- + +## Key Findings + +### 1. Acceptance Test Naming Mismatch + +**Issue:** The roadmap specifies exact acceptance test filenames, but the actual tests have different names or are unit tests instead of acceptance tests. + +**Examples:** +- Roadmap: `test_mcp_trust_onboarding_flow.py` β†’ Actual: `test_mcp_trust.py` (unit test) +- Roadmap: `test_automation_lifecycle_review_flow.py` β†’ Actual: `test_automation_lifecycle.py` (unit test) +- Roadmap: `test_cockpit_parity_flow.py` β†’ Actual: `test_daily_cli.py` + `test_daily_tui.py` (different tests) + +**Impact:** Medium - The functionality may exist but under different test names, making it hard to track completion. + +--- + +### 2. Some Functionality Implemented as Unit Tests + +**Issue:** Several roadmap items have unit tests but not the named acceptance tests. + +**Examples:** +- MCP trust onboarding: `test_mcp_trust.py` exists (unit test) +- Automation lifecycle: `test_automation_lifecycle.py` exists (unit test) +- Cockpit: `test_cockpit.py` exists (unit test) + +**Impact:** Low - Functionality exists but may not have full end-to-end acceptance coverage. + +--- + +### 3. IDE Command Parity Blocked + +**Issue:** "IDE command parity" journey requires VS Code extension, which doesn't exist in the repo. + +**Impact:** Medium - This journey cannot be completed without the extension. + +--- + +### 4. Several Journeys Not Implemented + +**Issue:** Many P0 and P1 journeys have no corresponding tests at all: +- First-task from issue text +- Plan review and revision +- Guided recovery +- Cloud/background parity +- Slack/message intake +- Extension activation explain +- Provider fallback day two +- Risk-mode decision table + +**Impact:** High - These are P0/P1 items that represent gaps in the system. + +--- + +## Recommendations + +### Immediate Actions + +1. **Update roadmap with actual test names:** + - Change `test_mcp_trust_onboarding_flow.py` to reference existing `test_mcp_trust.py` (or create the acceptance test) + - Change `test_automation_lifecycle_review_flow.py` to reference existing `test_automation_lifecycle.py` (or create the acceptance test) + - Add completion markers for items that have similar tests with different names + +2. **Mark IDE command parity as blocked:** + - Add note that VS Code extension doesn't exist + - Remove from active roadmap or mark as blocked + +3. **Create acceptance tests for P0 gaps:** + - `test_issue_to_plan_acceptance_flow.py` (P0) + - `test_plan_review_revision_flow.py` (P0) + - `test_guided_recovery_flow.py` (P0) + +### Process Improvements + +1. **Separate unit tests from acceptance tests in roadmap:** + - Clearly distinguish between "unit test exists" and "acceptance test exists" + - Use different status markers for each + +2. **Cross-reference roadmap with actual test suite:** + - Run a script to check which named tests actually exist + - Update roadmap automatically or regularly + +3. **Consider renaming existing tests to match roadmap:** + - If the functionality is correct, rename tests to match the roadmap names + - Or update the roadmap to match the actual test names + +--- + +## Conclusion + +The agent-ecosystem roadmap is partially implemented but has significant naming mismatches and gaps. Several P0/P1 journeys have no tests at all, while others have unit tests but not the named acceptance tests. The roadmap should be updated to reflect the actual state of the codebase. + +--- + +**Reviewed:** 2026-05-31 +**Journeys reviewed:** 18 +**Fully implemented (acceptance tests):** 0 +**Partially implemented (unit tests only):** 8 +**Not implemented:** 10 diff --git a/docs/analysis/competitive-feedback-refresh-2026-06-01.md b/docs/analysis/competitive-feedback-refresh-2026-06-01.md new file mode 100644 index 0000000..b6181d4 --- /dev/null +++ b/docs/analysis/competitive-feedback-refresh-2026-06-01.md @@ -0,0 +1,106 @@ +# Competitive & Community Feedback Refresh (Delta vs 2026-05-31) +# 2026-06-01 + +**Purpose:** The 2026-05-31 survey (`agent-market-ux-survey-2026-05-31.md`) is +thorough and remains the baseline. This is a short **delta** pass β€” one day later β€” +to (a) honor the standing instruction to re-run competitive research before release +claims, and (b) capture *new, specific* signal that sharpens the code-grounded +findings in `daily-driver-code-grounded-ux-findings-2026-06-01.md`. It is not a +re-survey; read the baseline first. + +**Method:** Fresh web search (June 2026), read-only. Sources linked inline. Claims +are attributed; inferences are marked `[inference]`. + +--- + +## Delta D-1 β€” Cost/token display has moved from "nice to have" to a primary axis + +The baseline ranked cost transparency #2 and #10 on the wishlist. New signal shows it +is now a category where dedicated tools compete on *accuracy and granularity*, not +mere presence: + +- **DeepSeek-TUI** ships a live cost tracker showing **per-turn and session-level** + token usage with a **cache hit/miss breakdown** β€” because cached input is 1/10th + the price, cache utilization is itself a surfaced economic signal. + ([silenceper](https://silenceper.com/en/article/2026-05-08-deepseek-tui-terminal-agent/), + [Efficient Coder](https://www.xugj520.cn/en/archives/deepseek-tui-terminal-coding-agent-guide.html)) +- **`tokscale`** is a standalone CLI whose entire purpose is tracking token usage + *across* OpenCode, Claude Code, Codex, Gemini, Cursor, Factory Droid, Kimi, etc. β€” + evidence that built-in cost display is so untrusted/absent that an aggregator + market exists. ([tokscale](https://github.com/junhoyeo/tokscale)) +- **Codeburn** reads Claude Code / Cursor local session logs to render a live token- + spend dashboard β€” same gap, different vendor. + ([Developers Digest](https://www.developersdigest.tech/blog/codeburn-tui-dashboard-for-claude-code-token-spend)) +- **Hermes-Agent #504** debates the right *source* for token counts: + server-reported (accurate, delayed) vs local tiktoken (immediate, approximate). + ([GitHub #504](https://github.com/NousResearch/hermes-agent/issues/504)) + +**Implication for teaagent:** Finding **CG-03** (both surfaces display fabricated / +zero cost) is not a cosmetic bug β€” it puts teaagent *behind the table stakes* of a +category competitors are actively differentiating on. Because `RunResult` already +carries real `cost_cents` + token counts, teaagent can leap from "fabricated" to +"server-reported with cache awareness" in one focused change, and should label the +source (the Hermes debate) so the number is trusted. + +--- + +## Delta D-2 β€” REPL rendering fragility is a named reason developers switch TUIs + +- Claude Code's REPL is criticized because *"resizing your window mid-response can + break the rendering, and scrolling back far enough creates messy display issues."* + ([Nimbalyst](https://nimbalyst.com/blog/claude-code-vs-codex-vs-opencode-definitive-comparison/)) +- Reviewers cite OpenCode's **OpenTUI** (TypeScript layer + native Zig renderer) as + "better for extended sessions compared to Claude Code's REPL approach." + ([Thomas Wiegold](https://thomas-wiegold.com/blog/i-switched-from-claude-code-to-opencode/), + [Nimbalyst](https://nimbalyst.com/blog/claude-code-vs-codex-vs-opencode-definitive-comparison/)) + +**Implication for teaagent:** Finding **CG-06** (TUI clears the screen every prompt +for terminals β‰₯120Γ—30) is the *same failure class reviewers already punish*, except +teaagent auto-enables it for large terminals β€” the exact configuration power users +run. This is a switching-trigger, not a polish item. `[inference]` A real +prompt_toolkit fixed-region layout would turn a liability into the "good for extended +sessions" property reviewers reward. + +--- + +## Delta D-3 β€” Defection narratives are fast and multi-surface + +- DeepSeek-TUI gained **+580 GitHub stars in 24h** (May 1, 2026), riding an + "I switched from Claude" narrative that hit AI YouTube, X, and GitHub the same + morning. ([AgentConn](https://agentconn.com/agents/deepseek-tui/), + [GitHub](https://github.com/DeepSeekTUI/DeepSeek-TUI)) +- "OpenCode vs Codex vs Claude Code" is now *the* comparison developers face; all + three are considered mature. + ([builder.io](https://www.builder.io/blog/opencode-vs-claude-code)) + +**Implication for teaagent:** First-impression correctness matters +disproportionately. A new user who runs `teaagent chat` and sees CG-01 ("every task +failed, no answer shown") forms the switching judgment in the first 60 seconds. The +baseline survey's onboarding theme (**UX-F5**, "visible value in the first 5 +minutes") is gated entirely on CG-01 being fixed. + +--- + +## What did NOT change since 2026-05-31 + +- Governance-first positioning remains the durable differentiator (NIST agent- + identity standardization, Gravitee enterprise security data). No new evidence + contradicts the baseline here. +- No new entrant displaces the Claude Code / Cursor / OpenCode / Codex top tier. +- The verification-bottleneck thesis ("less capability, more reliability") holds. + +--- + +## Sources + +- [Hermes-Agent issue #504 β€” Enhanced CLI TUI token tracking](https://github.com/NousResearch/hermes-agent/issues/504) +- [DeepSeek-TUI terminal agent (silenceper)](https://silenceper.com/en/article/2026-05-08-deepseek-tui-terminal-agent/) +- [DeepSeek-TUI review (AgentConn)](https://agentconn.com/agents/deepseek-tui/) +- [DeepSeek-TUI (GitHub)](https://github.com/DeepSeekTUI/DeepSeek-TUI) +- [DeepSeek-TUI 2026 guide (Efficient Coder)](https://www.xugj520.cn/en/archives/deepseek-tui-terminal-coding-agent-guide.html) +- [Codeburn token-spend dashboard (Developers Digest)](https://www.developersdigest.tech/blog/codeburn-tui-dashboard-for-claude-code-token-spend) +- [tokscale (GitHub)](https://github.com/junhoyeo/tokscale) +- [OpenCode vs Codex vs Claude Code (Nimbalyst)](https://nimbalyst.com/blog/claude-code-vs-codex-vs-opencode-definitive-comparison/) +- [I switched from Claude Code to OpenCode (Thomas Wiegold)](https://thomas-wiegold.com/blog/i-switched-from-claude-code-to-opencode/) +- [OpenCode vs Claude Code (builder.io)](https://www.builder.io/blog/opencode-vs-claude-code) + diff --git a/docs/analysis/daily-driver-assumptions-and-nongoals-2026-06-01.md b/docs/analysis/daily-driver-assumptions-and-nongoals-2026-06-01.md new file mode 100644 index 0000000..d727f2e --- /dev/null +++ b/docs/analysis/daily-driver-assumptions-and-nongoals-2026-06-01.md @@ -0,0 +1,42 @@ +# Daily-Driver Review β€” Assumptions & Non-Goals +# 2026-06-01 + +Explicit log of every premise the review relied on, and every exclusion it recommends. +Stating these makes the review falsifiable: if an assumption is wrong, the conclusions +it supports should be re-examined. + +## Assumptions (premises the review depends on) + +| ID | Assumption | If false… | Confidence | +|----|------------|-----------|:----------:| +| **AS-1** | "cx skill" in the request meant the `cx-cli` code-analysis tool used in this repo, not a Claude skill | Re-route to that tool; findings unaffected (they came from direct reads) | H | +| **AS-2** | `run_chat_agent` returning `RunResult` is the contract the REPL should honor (not an int) | CG-01 fix changes; but the initial-task path at `chat_repl.py:560` already assumes `RunResult`, so AS-2 is corroborated in-tree | H | +| **AS-3** | `_session_cost_cents` not being incremented is a bug, not an intentional always-zero placeholder | CG-03 severity drops to cosmetic; the UI still misleads | H | +| **AS-4** | The May-31 corpus is accurate and current enough to build on (not re-survey) | Would need a full re-survey; the June-1 delta partially hedges this | M | +| **AS-5** | Line numbers reflect branch `codex/plan-exec-2026-05-31` working tree at 2026-06-01 | Re-anchor before editing; symbols also cited as backup | H | +| **AS-6** | Static reading is sufficient to assert CG-01/CG-02/CG-03 without executing | The plan attaches a regression test to each, converting assertion β†’ executable proof | M-H | + +## Recommended non-goals (what this review says to *exclude*) + +| ID | Non-goal | Rationale | Source | +|----|----------|-----------|--------| +| **NG-1** | Do **not** regenerate the May-31 broad UX survey | Already strong; duplication dilutes signal | J-1 | +| **NG-2** | Do **not** write an IDE/desktop packaging plan in this pass | Concerns "partial" code not yet read; would be speculation | J-10 | +| **NG-3** | teaagent should **not** become a second LangGraph/CrewAI workflow framework | Product contract; keep tool-governance + audit central, domain logic outside | GAP F-ECO-014 | +| **NG-4** | The evidence bundle is **not** a replacement for the audit log/trace/replay | It is the closing summary layer, not a new store | EVB | +| **NG-5** | The cockpit contract is **not** a real-time streaming dashboard rewrite | Scope is parity + completeness of one snapshot | CKP | +| **NG-6** | Do **not** keep generating near-duplicate documents for volume alone | Past a point, more files reduce signal; prefer consolidation | DQ-7 | + +## Constraints honored + +- **No code changed** β€” analysis and docs only. +- **No external publishing** β€” all artifacts are local files under `docs/`. +- **Sourced research** β€” every external claim in the competitive refresh links a URL. +- **Falsifiability** β€” each finding cites `file:line`; each fix specifies a test. + +## How to use this file + +Before acting on any conclusion, check the assumption(s) it rests on here. If you +invalidate an assumption (e.g. AS-3 turns out intentional), annotate it and re-grade +the dependent finding in the recommendation log. + diff --git a/docs/analysis/daily-driver-code-grounded-ux-findings-2026-06-01.md b/docs/analysis/daily-driver-code-grounded-ux-findings-2026-06-01.md new file mode 100644 index 0000000..22feb8c --- /dev/null +++ b/docs/analysis/daily-driver-code-grounded-ux-findings-2026-06-01.md @@ -0,0 +1,262 @@ +# Daily-Driver Code-Grounded UX Findings β€” TUI, Chat, Agent Modes +# 2026-06-01 + +**Why this doc exists.** The 2026-05-31 corpus +(`agent-market-ux-survey-2026-05-31.md`, +`agent-ecosystem-daily-use-gap-review-2026-05-31.md`) is strong, but states its own +residual risk: *"This review did not run full acceptance tests; it relied on existing +acceptance collection and docs."* Those are **doc-level** findings. This pass is +**code-level**: it reads the actual implementation of the three daily surfaces and +records defects that block the product's own stated goal β€” being a trustworthy daily +driver in TUI, TUI-chat, and agent modes. + +**Method.** Direct read of `teaagent/tui/__init__.py`, +`teaagent/cli/_handlers/chat_repl.py`, `teaagent/chat_agent.py`, +`teaagent/runner/_core.py`. Every finding cites `file:line`. Severity uses the +project's existing P0/P1/P2 convention. Each finding is mapped to the baseline survey +theme it grounds (e.g. UX-F6) so the two docs reinforce rather than duplicate. + +**Scope note.** Findings are about *operator-facing correctness and trust*, not style. +No code was changed in this pass. Fixes are specified in +`docs/plans/daily-driver-hardening-plan-2026-06-01.md`; residual risk is tracked in +`docs/analysis/daily-driver-risk-register-2026-06-01.md`. + +--- + +## Summary table + +| ID | Severity | Surface | One-line | Grounds survey theme | +|----|----------|---------|----------|----------------------| +| CG-01 | **P0** | `teaagent chat` REPL | Every interactive task reports failure and the answer is never shown | UX-F4, UX-F5 | +| CG-02 | **P0** | chat REPL `/undo` | Undo runs `git checkout -- .`, destroying *all* uncommitted work, not just agent edits | UX-F4 | +| CG-03 | **P1** | TUI + REPL | `/cost`, `/budget` display fabricated or always-zero spend despite real `cost_cents` being available | UX-F1, UX-F6 | +| CG-04 | **P1** | chat REPL | `/compact` and `/clear` operate on a context structure the loop never populates β†’ no real effect | UX-F3 | +| CG-05 | **P1** | TUI + REPL | Two divergent chat implementations duplicate checkpoint/undo/effort logic with conflicting behavior | (maintainability β†’ UX drift) | +| CG-06 | **P1** | TUI split-pane | Auto-enabled "split-pane" clears the screen every prompt, destroying scrollback/output | UX-F3 (rendering), Delta D-2 | +| CG-07 | **P2** | TUI | `compact` command is a hardcoded "not yet implemented" stub but advertised in help | UX-F3 | +| CG-08 | **P2** | TUI + REPL | Two undo systems (checkpoint vs `agent undo`) with overlapping names confuse recovery | UX-F4 | + +--- + +## CG-01 β€” `teaagent chat` reports every task as failed and never prints the answer [P0] + +**Evidence.** `teaagent/chat_agent.py:374` β€” +`def run_chat_agent(*args, **kwargs) -> RunResult:` returns a `RunResult` +(carrying `.status`, `.final_answer`, `.cost_cents`). + +The **initial-task** path handles it correctly: + +``` +chat_repl.py:557 result = run_chat_agent(task=task_with_warnings, ...) +chat_repl.py:560 if result.status != 'completed': +``` + +The **interactive loop** path does not: + +``` +chat_repl.py:816 result = run_chat_agent(task=task_with_warnings, ...) +chat_repl.py:820 if result != 0: +chat_repl.py:821 print(f'[TeaAgent] Task failed with exit code {result}') +``` + +A `RunResult` is never equal to the integer `0`, so the `result != 0` branch is +**always true**. Consequence, for *every* task typed into the REPL: + +1. The user sees `Task failed with exit code ` even on success. +2. `result.final_answer.content` is **never printed** β€” the REPL never displays the + agent's answer at all. (Contrast the TUI, which prints `result.final_answer.content` + at `tui/__init__.py:860`.) + +**Why P0.** This is the primary chat surface's core loop. It fails the survey's +first-5-minutes test (UX-F5) and is a textbook UX-F4 "confident wrong report" +(claims failure on success). Per Delta D-3, this is exactly the 60-second +switching-trigger that drives defection narratives. + +**Fix direction.** Treat the return as `RunResult`: branch on `result.status`, print +`result.final_answer.content` on success, and feed the result back into +`session_context` (see CG-04). Specified in the hardening plan as P0-1. + +--- + +## CG-02 β€” chat REPL `/undo` destroys all uncommitted work [P0] + +**Evidence.** `chat_repl.py:783-801`. When the checkpoint stash is present, restore +does: + +``` +chat_repl.py:418 subprocess.run(['git', 'checkout', '--', '.'], cwd=config.root, ...) +chat_repl.py:424 subprocess.run(['git', 'stash', 'pop'], cwd=config.root, ...) +``` + +And when no checkpoint exists, the fallback (`chat_repl.py:789-799`) *also* runs +`git checkout -- .`. `git checkout -- .` reverts **every** tracked file in the +worktree to HEAD β€” including edits the human made by hand outside the agent, and edits +from a prior un-checkpointed task. The TUI's restore is surgical by comparison: it +reverts only the files captured in the stash (`tui/__init__.py:629-641`, +`git checkout HEAD -- `). + +**Why P0.** This is the irreversible-destruction failure the survey calls the single +least-tolerable agent behavior (UX-F4: "developers can tolerate errors if they are +reversible and visible; they cannot tolerate invisible, irreversible errors"). A user +who types `/undo` expecting to revert the last agent task can silently lose unrelated +manual work. The fact that checkpoint creation is *disabled by default* +(`chat_repl.py:537-539`, "Automatic checkpoint creation disabled for data safety") +makes the destructive fallback the common path. + +**Fix direction.** Remove the `git checkout -- .` fallback; scope undo to the run's +`UndoJournal` (already used by the TUI/agent path at `tui/__init__.py:779-782`) or to +explicitly checkpointed files only. Never touch files the agent did not write. +Specified as P0-2. + +--- + +## CG-03 β€” Cost and budget displays are fabricated / always zero [P1] + +**Evidence.** + +- REPL: `session_cost_cents += 10 # Placeholder: 10 cents per task` + (`chat_repl.py:563` and `:825`). `/cost` (`:661-666`) and `/budget`/effort status + (`:528-535`) report this placeholder, not real spend. +- TUI: `self._session_cost_cents` is initialized to `0.0` (`tui/__init__.py:184`) and + is **only ever read** β€” `_handle_cost` (`:670`), `_handle_effort` (`:674-678`), + `_handle_budget` (`:702-706`). It is never incremented anywhere. So TUI `/cost` + always prints `$0.00` regardless of actual usage. +- Real data is available and ignored: `RunResult` carries `cost_cents`, + `input_tokens`, `output_tokens` (`runner/_core.py`), and the TUI even computes a + `run_summary` from `result.cost_cents` at `tui/__init__.py:835-843` β€” then discards + it from the session counter. + +**Why P1.** Cost unpredictability is survey theme UX-F6 (HIGH) and the "Claude Is +Dead" rate-cap rage (UX-F1). Per Delta D-1, cost *accuracy* is now a competitive axis +(DeepSeek-TUI cache-aware tracking, `tokscale`, Codeburn). teaagent ships the UI for +this feature but wires it to fake numbers β€” arguably worse than omitting it, because +it teaches users to distrust the display. + +**Fix direction.** Increment session cost from `result.cost_cents` after each run in +both surfaces; surface input/output/cached token counts; label the source +(server-reported). Specified as P1-1. + +--- + +## CG-04 β€” REPL `/compact` and `/clear` act on an unpopulated context [P1] + +**Evidence.** `session_context['observations']` is appended to **only** on the +initial-task path (`chat_repl.py:564-570`). The interactive loop (`:804-827`) runs +tasks but never appends their results to `session_context`. So after the first turn, +`/compact` (`:611-625`) compresses a near-empty list and reports +`tokens_saved` / `compression_ratio` derived from nothing, and `/clear` (`:628-634`) +clears a structure that was never filling up. + +**Why P1.** Context rot is survey theme UX-F3 (HIGH) β€” "the agent starts contradicting +earlier decisions." The REPL advertises `/compact` as the mitigation, but it is inert +for the actual conversation because the conversation is never recorded in the +structure compaction reads. (Note: `run_chat_agent` may manage its own context +internally per call, but the REPL-level session memory the commands target is empty β€” +so the *operator-visible* compaction is theater.) + +**Fix direction.** Record each turn's task + result into `session_context` (this is +also required by CG-01's fix), so compaction and `/clear` operate on real history. +Specified as P1-2. + +--- + +## CG-05 β€” Two divergent chat implementations [P1] + +**Evidence.** `teaagent chat` uses `chat_repl.py::run_chat_repl`; the TUI chat mode +uses `tui/__init__.py::TeaAgentTUI._run_agent_task`. Both independently implement: +checkpoint create/restore, undo, effort levels, file-watcher, cost display, pinned +files β€” with **different behavior**: + +| Behavior | REPL (`chat_repl.py`) | TUI (`tui/__init__.py`) | +|---|---|---| +| Undo scope | `git checkout -- .` (all files) | surgical, stashed files only | +| Result handling | `result != 0` (broken, CG-01) | `result.status` / prints answer | +| Session memory | `session_context` list | `ChatSession` + `SessionStore` | +| Compaction | `ContextCompactor` on empty list | `compact` = stub (CG-07) | +| Cost | `+= 10` placeholder | `_session_cost_cents` never set | + +**Why P1.** Two code paths for "the same product feature" guarantee the kind of +behavior drift that produces CG-01/02/03 in the first place. The survey's +gap-review F-ECO-010 asks for "CLI/TUI/dashboard parity for the same run state"; this +is the root cause that makes parity impossible to maintain by hand. + +**Fix direction.** Extract a shared `ChatSessionController` that both surfaces drive, +owning result handling, cost accounting, undo scope, and session memory. Surfaces keep +only their I/O. Specified as P1-3 (enables P0-1/P0-2/P1-1/P1-2 to be fixed once). + +--- + +## CG-06 β€” TUI "split-pane" clears the screen every prompt [P1] + +**Evidence.** `_should_use_split_pane` returns true for terminals β‰₯120Γ—30 +(`tui/__init__.py:189-195`) and the main loop calls `_print_state_panel()` on **every** +iteration (`:318-319`). `_print_state_panel` begins with +`print('\033[2J\033[H', end='')` (`:205`) β€” a full clear-screen + cursor-home. It then +prints a *vertical* list labelled "[Chat Area]" and "[State Panel]" β€” it is not an +actual split pane, and the clear wipes the previous command's output (including agent +answers and approval prompts) before each new prompt. + +**Why P1.** This auto-activates on large terminals β€” precisely what power users run. +It destroys scrollback and prior output, which is the exact rendering-fragility +reviewers already punish in competitors (Delta D-2: "scrolling back creates messy +display"; OpenTUI praised as "better for extended sessions"). The feature as built is +a net regression versus a plain scrolling REPL. + +**Fix direction.** Either implement a real fixed-region layout with prompt_toolkit's +`Application`/full-screen layout (state panel in a side/bottom region, scrollable chat +buffer that is never cleared), or disable the screen-clear and render the state panel +as an opt-in `state` command. Specified as P1-4. + +--- + +## CG-07 β€” TUI `compact` is an advertised no-op [P2] + +**Evidence.** Help lists `compact Compact session context to save tokens` +(`tui/__init__.py:103`), but `_handle_compact` prints +`'compact: session compaction not yet implemented in TUI'` (`:666-667`). + +**Why P2.** Advertising a token-saving command that does nothing erodes trust on a +HIGH-priority theme (UX-F3), but it fails loudly (prints "not implemented") rather +than silently, so it is less harmful than CG-04. Fix as part of CG-05's shared +controller (the REPL compactor can be reused once session memory is shared). + +--- + +## CG-08 β€” Two overlapping undo systems confuse recovery [P2] + +**Evidence.** TUI help documents both `undo Undo all changes (using checkpoint)` +(`tui/__init__.py:108`) and, two lines earlier, references +`teaagent agent undo for advanced` (`:108` continuation) plus a separate +`undo [run_id] Restore workspace files from the last undo journal` +(`:76`). So a user has: git-stash checkpoint undo, `UndoJournal` run-scoped undo, and +the CLI `agent undo` β€” three mechanisms with two help entries that both say "undo". + +**Why P2.** Recovery is the moment trust is won or lost (UX-F4). Overlapping, +differently-scoped undo verbs make it ambiguous *what* a given `undo` will revert +(the whole concern behind CG-02). Consolidate onto the `UndoJournal` as the single +operator-facing undo, and rename the git-stash one to `checkpoint restore` to remove +the collision. Specified as P2-1. + +--- + +## Cross-reference to baseline survey themes + +| Survey theme (2026-05-31) | Grounded by | +|---|---| +| UX-F1 Rate/cap surprises | CG-03 | +| UX-F3 Context rot | CG-04, CG-06, CG-07 | +| UX-F4 Silent/irreversible action | CG-01, CG-02, CG-08 | +| UX-F5 Onboarding / first 5 min | CG-01 | +| UX-F6 Cost unpredictability | CG-03 | +| F-ECO-010 CLI/TUI parity | CG-05 | + +## Residual risk of this review + +- Findings are from static reading, not execution. CG-01 and CG-03 are + near-certain (the type mismatch and the absent increment are unambiguous in source); + CG-04 depends on whether `run_chat_agent` also persists session memory by another + path β€” the *operator-visible* commands are still inert regardless. Each fix in the + plan ships with a regression test that makes the behavior executable-verifiable. +- Line numbers reflect the working tree at 2026-06-01 on branch + `codex/plan-exec-2026-05-31`; re-anchor before editing. + diff --git a/docs/analysis/daily-driver-findings-second-pass-2026-06-01.md b/docs/analysis/daily-driver-findings-second-pass-2026-06-01.md new file mode 100644 index 0000000..4c06d8f --- /dev/null +++ b/docs/analysis/daily-driver-findings-second-pass-2026-06-01.md @@ -0,0 +1,128 @@ +# Daily-Driver Findings β€” Second Pass (Reconsideration) +# 2026-06-01 + +**Why this doc exists.** A deliberate re-audit of the daily surfaces to (a) catch +findings the first pass missed, (b) re-examine the severity of existing findings, and +(c) state honestly what is *still* uncovered. Triggered by the question "reconsider all +risks and gaps." Findings here continue the `CG-##` numbering from +`daily-driver-code-grounded-ux-findings-2026-06-01.md`. + +--- + +## New findings (missed by the first pass) + +### CG-09 β€” `/background` is misleading and silently switches your git branch [P1] + +**Evidence.** `chat_repl.py::suspend_to_background`: +- `chat_repl.py:130-132` β€” `git checkout -b ` creates **and switches to** + `suspended-`, and **never switches back**. The user is left on a new branch + without a clear statement that their HEAD moved. +- `chat_repl.py:150` β€” prints *"Background execution requires manual setup"* β€” i.e. **no + background execution actually happens**. +- `chat_repl.py:640-648` β€” but the caller announces *"Interactive session converted to + background task"* and *"You can now safely exit the REPL."* + +So `/background` claims a capability it does not deliver and mutates git state as a side +effect. This is a UX-F4 "confident misreport" plus a workflow-surprise data risk. + +**Divergence.** The TUI's `_handle_background` (`tui/__init__.py:717-720`) just prints +*"use teaagent agent run --detach"* and does nothing β€” so the same verb means two +entirely different things across surfaces (reinforces **CG-05**). + +**Why P1 (not P0).** It does not destroy file contents (unlike CG-02), but it silently +relocates the user's branch and misrepresents the outcome. Until fixed, the help text +and printed messages should state plainly: *"creates a branch snapshot; does not run +anything in the background; you remain on the new branch."* + +**Fix direction.** Either (a) wire `/background` to the real detach path +(`agent run --detach`) so the claim is true, or (b) demote it to an honest +`/snapshot` that says exactly what it does and restores the original branch. Tie to +**DQ-1** (background commitment vs non-goal). + +--- + +### CG-10 β€” Background suspension bypasses the audit chain [P1] + +**Evidence.** `suspend_to_background` writes an `audit_trail` dict +(`chat_repl.py:85-90`) **into the local suspension JSON file** (`:92-96`) and creates a +git branch β€” but never calls the real `AuditLogger`/audit chain (contrast the governed +run path in `tui/__init__.py:776-782`, which uses `store.audit_logger()` and an +`UndoJournal`). The `audit_trail` key is self-described JSON, not a tamper-evident event. + +**Why P1.** Audit-everything is teaagent's stated core differentiator (UX-F7, security +whitepaper). A state-changing operation (branch creation, file writes) that is invisible +to the audit chain is a governance hole β€” exactly the gap enterprises cite (only 21% +have runtime visibility; 33% have no audit trail, per the May-31 survey Β§4). + +**Fix direction.** Route suspension through the audit logger (emit a +`session_suspended` event with branch + file refs) so background transitions are +auditable. Couple with CG-09's fix. + +--- + +## Severity re-examination of first-pass findings + +| ID | First-pass | Re-examined | Rationale | +|----|-----------|-------------|-----------| +| CG-01 | P0 | **P0 (confirmed)** | Type mismatch is unambiguous; primary surface broken | +| CG-02 | P0 | **P0 (confirmed)** | Only irreversible/data-loss item; release blocker | +| CG-03 | P1 | P1 (confirmed) | Misleads but does not corrupt; competitive table-stakes | +| CG-04 | P1 | **P1β†’P2 (consider)** | Operator-visible compaction is theater, but `run_chat_agent` may manage real context internally β€” *user harm is "false reassurance," not lost work*. Could drop to P2 if internal context is sound. **Needs the P1-2 test to decide.** | +| CG-05 | P1 | **P1β†’escalate** | Root cause of CG-01/02/03/09; the divergence keeps producing new bugs (CG-09 is fresh proof). Treat as the *first* structural fix, not a late one | +| CG-06 | P1 | P1 (confirmed) | Auto-on for power users; named switching-trigger | +| CG-07 | P2 | P2 (confirmed) | Fails loudly ("not implemented") | +| CG-08 | P2 | P2 (confirmed) | Ambiguity, not loss | + +**Net change:** +2 findings (CG-09, CG-10, both P1). CG-05 escalated in *sequencing* +priority (do it early). CG-04 flagged as possibly-overstated pending its test. + +--- + +## Completeness audit β€” are all risks & gaps now covered? + +### Survey themes (UX-F#) + +| Theme | Status after second pass | +|-------|--------------------------| +| UX-F1 caps | covered (CG-03 / P1-1) | +| UX-F2 autonomous changes w/o permission | **covered by existing governance** β€” approval flow gates destructive tools; no finding contradicts it. *Residual:* verified by reading, not by an adversarial test (see residual R-2) | +| UX-F3 context rot/rendering | covered (CG-04, CG-06, CG-07) | +| UX-F4 silent/irreversible | covered (CG-01, CG-02, CG-08, **CG-09, CG-10**) | +| UX-F5 onboarding | covered (CG-01) | +| UX-F6 cost | covered (CG-03) | +| UX-F7 trust under autonomy | covered (governance + **CG-10** is the new gap) | +| UX-F8 IDE lock-in | structural advantage; not a defect (IDE spec covers parity) | + +### Ecosystem gaps (F-ECO-###) + +All 13 (002–014) now have a spec or an explicit decision/non-goal (see INDEX). F-ECO-003 +(background) is **directly implicated by CG-09/CG-10** β€” the background story is not just +"thin," it is *misleading* in the REPL. This raises F-ECO-003's priority. + +--- + +## Residual risks (honestly stated β€” what this review still cannot guarantee) + +- **R-1 (static analysis).** All `CG-##` findings are from reading, not execution. The + hardening plan attaches a test to each; until those run, severities are + evidence-based estimates. CG-04 is the most likely to move. +- **R-2 (no adversarial test of UX-F2).** I confirmed the approval gate by reading, not + by trying to make the agent act without approval. A negative test + (`test_destructive_tool_requires_approval_all_modes`) should be added to *prove* it. +- **R-3 (surfaces not exhaustively read).** I read TUI, chat REPL, run_summary, and the + modules behind the 11 specs. I did **not** fully read: `swarm.py`, `consensus.py`, + `tournament/`, `gateway/`, `control_plane_*`, `federated_sync.py`. Findings do not + cover those; they are out of the daily-driver scope but are *not* certified clean. +- **R-4 (spec risks unassessed until now).** The 11 design specs could each introduce + new risk when built β€” assessed in `daily-driver-execution-readiness-2026-06-01.md` + (the `SR-#` register). +- **R-5 (line drift).** All `file:line` refs are against branch + `codex/plan-exec-2026-05-31` at 2026-06-01; re-anchor before editing. + +## Updates to other docs (apply these) + +- Add **CG-09, CG-10** to the recommendation log Β§1 and the traceability matrix. +- Add **PR-7 (branch-switch surprise)** and **PR-8 (unaudited suspension)** to the risk + register. +- Raise **F-ECO-003** priority in the journey-maps P-OPS section. + diff --git a/docs/analysis/daily-driver-open-decisions-2026-06-01.md b/docs/analysis/daily-driver-open-decisions-2026-06-01.md new file mode 100644 index 0000000..7b0112f --- /dev/null +++ b/docs/analysis/daily-driver-open-decisions-2026-06-01.md @@ -0,0 +1,36 @@ +# Daily-Driver Review β€” Open Decisions Register +# 2026-06-01 + +Every point in the 2026-06-01 review that is genuinely the **maintainer's call** β€” not +something I should decide by default. Each has options, a recommendation, and what it +blocks. Resolve these before the corresponding work starts. + +| ID | Decision | Options | Recommendation | Blocks | +|----|----------|---------|----------------|--------| +| **DQ-1** | Are P-OPS background/cloud journeys a near-term commitment or a documented non-goal? | (a) commit + build lifecycle (b) declare non-goal w/ attach recipes (F-ECO-004) | (a) if enterprise is the target; else (b) and say so loudly | journey-maps P-OPS rows; F-ECO-003/004 | +| **DQ-2** | Should P-ML parallel-experiment comparison emit a file artifact (reproducible) or stay TUI-only? | (a) write comparison evidence file (b) TUI-only | (a) β€” reproducibility is the ML persona's core need | evidence-bundle scope | +| **DQ-3** | P1-4 TUI: full prompt_toolkit fixed-region app, or just drop the auto-clear? | (a) real layout (medium effort) (b) drop auto-clear + opt-in `state` (small) | (b) now to stop the regression; (a) as a follow-up | CG-06 fix size; cockpit render | +| **DQ-4** | Should `prompt` mode + background be (a) auto-pre-granted, (b) refused with a message, or (c) silently allowed (current)? | a / b / c | (b) refuse with a clear message until pre-grant/JIT is set | PMR background row; PR-1 interaction | +| **DQ-5** | Cost source for display: server-reported only, local tiktoken estimate, or both labeled? | server / local / both | both, **labeled** (per Hermes #504 debate) | P1-1; cockpit budget; evidence economics | +| **DQ-6** | Consolidate undo onto `UndoJournal` and rename git-stash to `checkpoint restore` β€” acceptable breaking change to the `undo` verb? | (a) yes, rename (b) keep both, add docs | (a) β€” overlapping `undo` verbs are a recovery hazard (CG-08) | P2-1 | +| **DQ-7** | Is this volume of docs the desired working mode, or should findings go straight to GitHub issues / the existing `docs/backlog-priority.md`? | (a) keep doc package (b) issues (c) fold into backlog-priority | maintainer preference β€” see note below | future review cadence | + +## Note on DQ-7 (and on "write as many docs as possible") + +I produced a **consolidation package** (index, recommendation log, this register, +assumptions/non-goals, traceability, backlog) rather than many near-duplicate files, +because past a point more files *reduce* signal β€” the May-31 corpus already covers the +broad survey, and duplicating it dilutes the code-grounded findings that are the real +new value. If the intent is issue-tracker tickets or entries in +`docs/backlog-priority.md` instead of standalone docs, say so and I'll convert; that may +be the better home for actionable items. + +## How to record decisions + +When you resolve one, append a line here: `DQ-#: chose