Skip to content

feat: task_message tiered storage (D1 metadata + R2 content + KV cache)#84

Open
GenerQAQ wants to merge 13 commits into
mainfrom
feat/task-message-tiered-storage
Open

feat: task_message tiered storage (D1 metadata + R2 content + KV cache)#84
GenerQAQ wants to merge 13 commits into
mainfrom
feat/task-message-tiered-storage

Conversation

@GenerQAQ

@GenerQAQ GenerQAQ commented May 19, 2026

Copy link
Copy Markdown
Contributor

Why we need this PR?

D1 database is at 8GB with only 78 users. Root cause: task_message table has 6.3M rows storing full tool call content/input/output in D1. Two users alone account for 6.3M rows (4.1M + 2.2M). This PR moves the heavy content to R2 with KV caching, keeping only lightweight metadata in D1.

Expected D1 reduction: 8GB → <1GB

What changed

  • New TaskMessageStore class (src/web/src/lib/task-message-store.ts) — tiered read/write with KV → R2 fallback
  • Daemon write endpoint — dual-writes D1 metadata + R2/KV full content
  • User read endpoints — reads from KV/R2 store instead of D1
  • Conversation delete — cleans up R2/KV when tasks are deleted
  • New R2 bucket alook-task-messages (already created)
  • Batch migration scriptscripts/migrate-task-messages-to-r2.ts with --dry-run and --offset support

Storage Architecture

D1 (metadata): id, task_id, seq, type, tool, call_id, created_at
R2 (content):  task-messages/{taskId}.json → full TaskMessage[]
KV (cache):    tm:{taskId} → same JSON, TTL 7 days

Deployment Sequence

  1. ✅ R2 bucket alook-task-messages created
  2. Merge this PR → enables dual-write (D1 full row + R2/KV content)
  3. Run migration script: npx tsx scripts/migrate-task-messages-to-r2.ts --dry-run then without --dry-run
  4. Verify R2 data integrity
  5. Separately: add migration 0030 to drop D1 content/input/output columns (in a follow-up PR after confirming R2 data is complete)

Migration 0030 (NOT in this PR, for follow-up)

CREATE TABLE task_message_new (
  id TEXT PRIMARY KEY,
  task_id TEXT NOT NULL REFERENCES agent_task_queue(id) ON DELETE CASCADE,
  seq INTEGER NOT NULL,
  type TEXT NOT NULL DEFAULT '',
  tool TEXT NOT NULL DEFAULT '',
  call_id TEXT NOT NULL DEFAULT '',
  created_at TEXT NOT NULL DEFAULT (datetime('now'))
);
INSERT INTO task_message_new (id, task_id, seq, type, tool, call_id, created_at)
  SELECT id, task_id, seq, type, tool, call_id, created_at FROM task_message;
DROP TABLE task_message;
ALTER TABLE task_message_new RENAME TO task_message;
CREATE INDEX idx_task_message_task_seq ON task_message(task_id, seq);
CREATE INDEX idx_task_message_task_created ON task_message(task_id, created_at);

Checklist

  • Tests added/updated as needed (51 tests covering store + all affected routes)
  • All CI checks pass (lint ✅, no new tsc errors)
  • PR targets the correct branch

Impact Areas

  • Shared library (@alook/shared)
  • Web app (@alook/web)
  • CLI (@alook/cli)
  • Email Worker (@alook/email-worker)
  • WebSocket DO (@alook/ws-do)
  • CI/CD
  • Other: R2 bucket, migration script

GenerQAQ added 13 commits May 19, 2026 11:35
D1 stores metadata only, R2 stores full message content per task,
KV provides read-through cache. Targets reducing D1 from 8GB to <1GB.
24 tests covering: append, list, delete, filtering (since/excludeTypes),
graceful KV degradation, data integrity across write/read cycles.
Dual-write: D1 metadata + R2/KV full content. Read from store.
chat-init and task messages now read from R2/KV store.
This migration should only be applied AFTER the batch R2 migration
has completed and R2 data integrity is verified.
Supports --dry-run and --offset for safe incremental migration.
Run before applying migration 0030.
All tests now mock TaskMessageStore class instead of D1 query functions.
Migration will be applied separately after batch R2 migration completes,
since bump auto-executes migrations on deploy.
- taskMessageToResponse now handles both camelCase (Drizzle) and
  snake_case (TaskMessage interface from R2/KV store)
- Daemon GET endpoint verifies task belongs to workspace before
  returning messages, matching POST behavior
@GenerQAQ GenerQAQ force-pushed the feat/task-message-tiered-storage branch from d521e48 to a2013f9 Compare May 19, 2026 04:31
@GenerQAQ GenerQAQ requested a review from a team as a code owner May 22, 2026 12:00

@gusye1234 gusye1234 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Architecture is sound and well-executed — approve with minor suggestions.

The tiered storage approach (D1 metadata + R2 content + KV cache) correctly addresses the scaling problem. 51 tests is solid coverage.

Actionable Items (non-blocking):

  1. Fix: Hardcoded D1_DATABASE_ID in migrate-task-messages-to-r2.ts — Replace with an env var. Hardcoded UUIDs in source are a maintainability and security concern.

  2. Fix: Unused cacheKeys.taskMessages in cache.ts — The store uses its own inline "tm:${taskId}" prefix instead. Either wire it into the store or remove the addition to avoid confusion.

  3. Consider: id: "" in R2 content — Messages stored in R2 have empty ID because D1 insert IDs are not captured from createTaskMessage results. If anything needs to correlate R2 messages back to D1 rows by ID, this will break. Document the assumption or propagate actual IDs.

  4. Race condition in appendMessages — The readAll + append + put sequence is not atomic. Acceptable if single-writer-per-task is a guaranteed invariant. Worth a comment documenting that assumption.

  5. Store instantiation repeated 5 times across route files. Consider a getTaskMessageStore() factory helper.

None of these block merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants