diff --git a/packages/software-factory/README.md b/packages/software-factory/README.md index 9e828f4f7f9..9f1def93f5d 100644 --- a/packages/software-factory/README.md +++ b/packages/software-factory/README.md @@ -1,127 +1,94 @@ # Software Factory -Local card-development harness for fast Boxel iteration. +The software factory is an automated card-development system that takes a brief (a description of what card to build) and produces a working Boxel card — complete with card definition, sample instances, catalog spec, and QUnit tests — in a target realm. -This package gives you a cached local realm fixture, an isolated realm server -with harness-managed ports, and a Playwright loop that exercises cards in the -real browser app shell. +## How It Works -## Prerequisites +The factory flow has four phases: + +1. **Intake** — Fetch a brief card from a source realm, normalize it into a structured representation +2. **Bootstrap** — Create a target realm (if needed), populate it with a Project card, Knowledge Articles, and starter Tickets +3. **Implementation** — An LLM agent picks up the active ticket and uses tool calls to write card definitions (`.gts`), sample instances (`.json`), catalog specs (`Spec/`), and QUnit test files (`.test.gts`) into the target realm +4. **Verification** — The orchestrator runs QUnit tests via Playwright in a real browser, collects structured results into a TestRun card, and feeds failures back to the agent for iteration + +The agent iterates (implement → test → fix) until tests pass or max iterations are reached. The orchestrator (the "ralph loop") controls iteration count, test execution, and ticket selection deterministically — the LLM handles only the implementation work. + +### Realm Roles + +- **Source realm** (`packages/software-factory/realm/`) — publishes shared modules, card type definitions (Project, Ticket, KnowledgeArticle, TestRun), briefs, and templates. Never written to by the factory. +- **Target realm** (user-specified) — receives all generated artifacts: card definitions, instances, specs, test files, and TestRun results. +- **Fixture realm** (`test-fixtures/`) — disposable test input for development-time verification of the factory itself. -This package is TypeScript-only. New scripts, tests, and package utilities -should be written in `.ts`, not `.mjs`. +### Target Realm Artifact Structure -Editor/type support for `.gts` files is provided through `glint` via this -package's `tsconfig.json`, matching the realm-package pattern used elsewhere in -the repo. Package linting currently runs `glint`, `eslint`, and `prettier`. +| Path | What it is | +| --------------------- | ------------------------------------------------------------------- | +| `Projects/` | Project card with objective, scope, success criteria | +| `Tickets/` | Ticket cards tracking implementation work | +| `Knowledge Articles/` | Context articles derived from the brief | +| `*.gts` | Card definition files | +| `*.test.gts` | Co-located QUnit test files | +| `CardName/` | Sample card instances with realistic data | +| `Spec/` | Catalog Spec cards linking to card definitions and sample instances | +| `Test Runs/` | TestRun cards with structured pass/fail results | + +## Prerequisites - Docker running -- Host app assets available at `http://localhost:4200/` - - use `cd packages/host && pnpm serve:dist` -- Boxel icons server available at `http://localhost:4206/` - - use `cd packages/boxel-icons && pnpm serve` - - in a worktree where boxel-icons hasn't been built, symlink the dist from the main checkout: - `ln -s /path/to/boxel/packages/boxel-icons/dist packages/boxel-icons/dist` - - required before `cache:prepare` — the harness indexes cards that reference icon modules - -The harness starts its own seeded test Postgres, Synapse, prerender server, and -isolated realm server. By default it serves the test realm and base realm from -the same fixed realm-server origin. The skills realm can be enabled when needed -with `SOFTWARE_FACTORY_INCLUDE_SKILLS=1`. - -For the software-factory Playwright flow, the isolated realm stack is -self-contained and writes its actual runtime URLs and ports to harness metadata. -The fixture realms use the placeholder origin `https://sf.boxel.test/`, which -the harness rewrites to the live source-realm URL at startup. The Playwright -flow does not require a separate external realm server on `http://localhost:4201/`. - -## Commands - -- `pnpm cache:prepare` - - Builds or reuses the cached template database for `test-fixtures/darkfactory-adopter/` -- `pnpm serve:support` - - Starts shared support services and prepares a reusable runtime context in the background -- `pnpm serve:realm` - - Starts the isolated realm server for `/test/` on a dynamically assigned realm-server URL -- `pnpm smoke:realm` - - Boots the isolated realm server, fetches `project-demo` as card JSON, and exits -- `pnpm factory:go -- --brief-url --target-realm-url ` - - Fetches and normalizes a brief, bootstraps the target realm, and prints a machine-readable run summary -- `pnpm test` - - Runs package tests from `tests/*.test.ts` and `tests/*.spec.ts` -- `pnpm test:node` - - Runs only Node-side `tests/*.test.ts` -- `pnpm test:playwright` - - Runs the browser tests against the software-factory Playwright harness -- `pnpm test:realm -- --realm-path ./realms/` - - Runs realm-hosted Playwright specs via the typed realm test runner -- `pnpm boxel:session` - - Prints browser session/auth payloads for the active Boxel profile -- `pnpm boxel:search -- --realm ...` - - Runs a typed `_search` query against a realm -- `pnpm boxel:pick-ticket -- --realm ...` - - Finds candidate tracker tickets in a target realm - -All commands accept an optional realm directory argument: +- `mise run dev-all` (starts realm server, host app, icons server, Postgres, Synapse) +- Matrix credentials (username/password) for realm creation and auth +- An [OpenRouter API key](https://openrouter.ai/keys) for the LLM agent (when running the full factory) -```bash -pnpm cache:prepare ./my-realm -pnpm serve:realm ./my-realm -pnpm smoke:realm ./my-realm Person/example-card -``` +## Running the Factory -## `factory:go` +Make sure the prerequisites above are met, and that you have a brief card published in the software-factory realm (e.g., `http://localhost:4201/software-factory/Wiki/sticky-note`). -Usage: +Set up credentials first (these persist in your shell session): ```bash -pnpm factory:go -- \ - --brief-url http://localhost:4201/software-factory/Wiki/sticky-note \ - --target-realm-url http://localhost:4201/hassan/personal/ \ - [--realm-server-url http://localhost:4201/] \ - [--mode implement] +export MATRIX_URL=http://localhost:8008/ +export MATRIX_USERNAME=your-username +read -s 'MATRIX_PASSWORD?Matrix password: ' && export MATRIX_PASSWORD +export OPENROUTER_API_KEY=sk-or-v1-your-key-here ``` -Parameters: - -- `--brief-url` - - Required. Absolute URL for the source brief card the factory should use as input. - - The command fetches card source JSON from this URL and includes normalized brief metadata in the summary. -- `--target-realm-url` - - Required. Absolute URL for the target realm the factory should bootstrap and later populate. -- `--realm-server-url` - - Optional. Explicit realm server URL for target-realm bootstrap when it cannot be inferred unambiguously from the target realm URL. -- `--mode` - - Optional. One of `bootstrap`, `implement`, or `resume`. Defaults to `implement`. -- `--help` - - Optional. Prints the command usage and exits. - -Auth: - -- `MATRIX_USERNAME` is required and determines the target realm owner. -- If the brief is in a public realm, you do not need any auth setup. -- If the brief is in a private realm, `factory:go` can authenticate using: - - the active Boxel profile in `~/.boxel-cli/profiles.json` - - `MATRIX_URL`, `MATRIX_USERNAME`, `MATRIX_PASSWORD`, and `REALM_SERVER_URL` -- When the target realm does not exist yet, `factory:go` creates it with `POST /_create-realm`. -- By default the target realm server URL is inferred from `--target-realm-url`, but `--realm-server-url` can override that when the realm server is mounted under a subdirectory. -- The realm-server `/_create-realm` contract is the readiness boundary for bootstrap. - -Private brief with explicit Matrix username/password env: +Then run the factory: ```bash -export MATRIX_URL=http://localhost:8008/ -export MATRIX_USERNAME=factory -read -s MATRIX_PASSWORD'?Matrix password: ' -export MATRIX_PASSWORD -export REALM_SERVER_URL=http://localhost:4201/ +cd packages/software-factory pnpm factory:go -- \ --brief-url http://localhost:4201/software-factory/Wiki/sticky-note \ - --target-realm-url http://localhost:4201/factory/personal/ \ - --realm-server-url http://localhost:4201/ + --target-realm-url http://localhost:4201/your-username/my-test-realm/ \ + --debug ``` +The `--debug` flag shows LLM prompts, tool calls and their results, and `console.log` output from QUnit tests as they run. + +### What to expect on the command line + +``` +[factory:go] mode=implement brief=http://localhost:4201/software-factory/Wiki/sticky-note +[factory:go] Starting bootstrap + implement flow... +[test-run-execution] Serving QUnit page at http://127.0.0.1: for realm ... +[test-run-execution] QUnit completed in ms: test(s) +[factory-implement] Updated ticket status to done +[factory:go] Implement complete: outcome=tests_passed iterations= toolCalls= +``` + +### What to expect in the Boxel host app (target realm) + +| Folder / File | What it is | +| -------------------------- | ------------------------------------------------------------------------- | +| `Projects/` | A Project card with the brief's objective and success criteria | +| `Tickets/` | Ticket cards — the active ticket should show status `done` | +| `Knowledge Articles/` | Context articles derived from the brief | +| `*.gts` | Card definition file(s) for the implemented card | +| `*.test.gts` | Co-located QUnit test file(s) | +| `StickyNote/` (or similar) | Sample card instance(s) with realistic data | +| `Spec/` | Catalog Spec card(s) linking to the card definition and sample instances | +| `Test Runs/` | TestRun card(s) with structured pass/fail results grouped by QUnit module | + ## Layout - `test-fixtures/darkfactory-adopter/` diff --git a/packages/software-factory/docs/phase-2-plan.md b/packages/software-factory/docs/phase-2-plan.md index 1aa1e95e530..21058d5c6c1 100644 --- a/packages/software-factory/docs/phase-2-plan.md +++ b/packages/software-factory/docs/phase-2-plan.md @@ -4,7 +4,7 @@ Phase 1 (`one-shot-factory-go-plan.md`) implements a fixed pipeline: intake → bootstrap → implement → test → iterate. This works for the first pass but hard-codes the loop structure and the relationship between implementation and testing. -Phase 2 moves to an **issue-driven loop** aligned with the target architecture in `architecture.md`. The orchestrator becomes a thin scheduler that picks the next issue and delegates everything — including bootstrap, implementation, test creation, and test execution — to the agent via the issue tracking system. +Phase 2 moves to an **issue-driven loop** aligned with the target architecture in `architecture.md`. The orchestrator becomes a thin scheduler that picks the next issue and delegates implementation work to the agent. The orchestrator owns validation (parse, lint, evaluate, instantiate, run tests) after every agent turn, feeding failures back so the agent can self-correct. ## Core Idea @@ -13,7 +13,8 @@ The factory loop iterates over **issues in the project**, one at a time. Each is 1. Select the next unblocked issue (based on ordering / dependency rules) 2. Hand it to the agent 3. Wait for the agent to exit -4. Read updated issue state and repeat +4. Run validation (parse, lint, evaluate, instantiate, run tests) — feed failures back as context +5. Read updated issue state and repeat (inner loop) or advance to next issue (outer loop) The agent always exits the same way — the orchestrator reads the issue's updated status/tags to decide what happened. If the agent tagged the issue as blocked (e.g., needs human clarification), the orchestrator skips it and moves on. If the issue is marked done, the orchestrator advances. This keeps the agent's exit path uniform — it doesn't need a separate "blocked" signal in its return type, it just updates the issue and exits. @@ -23,7 +24,7 @@ This makes the loop generic. It doesn't need to know whether an issue is "implem Issues need properties that let the orchestrator determine execution order. Possible fields (may use a combination): -- **priority** — numeric or enum, lower = execute first +- **priority** — enum (`high`, `medium`, `low`), high = execute first - **predecessors / blockedBy** — explicit dependency edges; an issue cannot start until its blockers are done - **order** — explicit sequence number for tie-breaking @@ -31,33 +32,56 @@ The selection algorithm: 1. Filter to issues with status `ready` or `in_progress` 2. Exclude issues whose `blockedBy` list contains any non-completed issue -3. Sort by priority (ascending), then by order (ascending) +3. Sort by priority (high first, then medium, then low), then by order (ascending) 4. Pick the first one Resume semantics: if an issue is already `in_progress`, it takes priority over `ready` issues (the factory was interrupted and should continue where it left off). -## Tests as Issues, Not Loop Phases +## Validation Phase After Every Iteration -In phase 1, test execution is a hard-coded step after the agent signals done. In phase 2, tests are just another issue type. +The loop has two levels: -During task breakdown, the agent creates issues like: +- **Outer loop** — iterates over all unblocked, unfinished issues (picks the next one when the current one is done or blocked) +- **Inner loop** — iterates on a single issue until the agent marks it done, blocked, or max iterations are reached -- "Implement StickyNote card definition" (type: implement) -- "Create sample StickyNote instances" (type: implement) -- "Write Playwright tests for StickyNote" (type: test) -- "Run Playwright tests for StickyNote" (type: test-execution) +Every **inner-loop iteration** (agent turn) is followed by a **validation phase** owned by the orchestrator. An issue may require multiple iterations before it's done — validation runs after each one. This is similar to how Phase 1 runs tests after the agent signals done, but expanded to a full automated evaluation pipeline. The agent does not need to create separate "run tests" issues — validation is baked into the inner loop. + +### Validation Steps + +After each agent turn in the inner loop, the orchestrator runs these checks deterministically (as described in `architecture.md`): + +1. **Parse** — Verify that all modified `.gts` and `.json` files are syntactically valid +2. **Lint** — Run lint checks on modified files +3. **Module evaluation** — Ensure card modules load and evaluate without errors (import resolution, no runtime crashes) +4. **Card instantiation** — Verify that sample card instances can be instantiated from their definitions +5. **Run existing tests** — Execute all QUnit `.test.gts` files in the target realm via the QUnit test page + +### Handling Failures -The test-execution issue has `blockedBy` pointing to the implementation and test-writing issues. When the orchestrator picks it up, the agent (or orchestrator) runs the test suite. If tests fail, the orchestrator can: +Validation failures are fed back to the agent as context in the **next inner-loop iteration**. The orchestrator does not create fix issues for validation failures — it iterates with the failure details so the agent can self-correct. This mirrors Phase 1's approach (feed test results back, iterate) but with a broader validation pipeline. -- Reopen the implementation issue with failure context -- Create a new fix issue that blocks a re-run of the test-execution issue -- Let the agent decide how to handle the failure (more flexible) +The inner loop continues until: -This removes the assumption that every implementation issue has a test phase. Some issues (e.g., "create knowledge article") may not need tests. Others (e.g., "run full regression") are pure test issues. +- The agent marks the issue as done (all validation passes) +- The agent marks the issue as blocked (needs human input) +- Max iterations are reached + +The agent always has the option to create new issues via tool calls if it determines that a failure requires separate work (e.g., "this card definition depends on another card that doesn't exist yet — creating a new issue for it"). But the orchestrator does not force this — the agent decides. + +### What This Means for Task Breakdown + +During task breakdown, the agent creates issues for implementation work: + +- "Implement StickyNote card definition" (type: implement) +- "Create sample StickyNote instances" (type: implement) +- "Write QUnit tests for StickyNote" (type: implement) +- "Create Catalog Spec for StickyNote" (type: implement) -### Completion Rule for Test Issues +The agent does **not** need to create "run tests" issues. Test execution happens automatically as part of the validation phase after every inner-loop iteration. -A test-execution issue is not considered done until all tests pass. If tests fail, the agent iterates — but the iteration is modeled as issue state transitions rather than a hard-coded retry loop in the orchestrator. +### Relationship to Phase 1 + +Phase 1 calls this "testing" — the orchestrator runs tests after the agent signals done, feeds failures back, and iterates. Phase 2 generalizes this to a full validation pipeline (parse + lint + evaluate + instantiate + test) and feeds all failures back in the same way. The key evolution is that validation is broader (not just tests) and runs after every agent turn (not just when the agent signals done). The validation is still orchestrator-owned and deterministic — the agent never decides whether to run validation. ## Bootstrap as Part of the Agentic Loop @@ -70,7 +94,7 @@ The flow becomes: 3. The agent picks up this seed issue, reads the brief, and creates: - The Project card - KnowledgeArticle cards - - The initial set of implementation and test issues + - The initial set of implementation issues (card definitions, instances, specs, tests) 4. The agent marks the seed issue as done 5. The orchestrator now has a populated issue backlog and continues the normal loop @@ -83,21 +107,116 @@ This is the "quirk" where an issue's job is to create the project itself. But it - The bootstrap process is testable with the same MockFactoryAgent pattern used for implementation issues - Resume works naturally — if the factory crashes during bootstrap, the seed issue is still `in_progress` and gets picked up on restart -## Orchestrator Simplification +## Orchestrator: Issue Loop + Validation -The phase 2 orchestrator becomes much thinner: +The phase 2 orchestrator is a thin scheduler with a built-in validation phase that runs after every agent turn: ``` while (hasUnblockedIssues()) { let issue = pickNextIssue(); - await agent.run(contextForIssue(issue), tools); - // Agent updates issue status/tags directly, then exits. - // Orchestrator reads the issue state to decide what happened. - refreshIssueState(issue); + + // Inner loop: multiple iterations per issue + let validationResults = null; + while (issue.status !== 'done' && issue.status !== 'blocked' && iterations < maxIterations) { + await agent.run(contextForIssue(issue, validationResults), tools); + refreshIssueState(issue); + + // Validation phase — runs after EVERY iteration + validationResults = await validate(targetRealm); // parse, lint, evaluate, instantiate, run tests + // Failures are fed back as context in the next iteration — agent self-corrects + // Agent can also create new issues via tool calls if it decides to + + iterations++; + } } ``` -The agent signals completion by updating the issue — tagging it as blocked, marking it done, etc. The orchestrator doesn't inspect a return value for status; it reads the issue state from the realm after the agent exits. All domain logic (what to implement, how to test, when to create sub-issues, when to tag as blocked) lives in the agent's prompt and skills, not in the orchestrator code. +The agent signals progress by updating the issue — tagging it as blocked, marking it done, or leaving it in progress for another iteration. The orchestrator reads issue state from the realm after each agent turn, then runs validation. Validation failures are fed back as context in the next inner-loop iteration so the agent can self-correct. The agent can also create new issues via tool calls if it determines a failure requires separate work. + +All domain logic (what to implement, when to create sub-issues, when to tag as blocked) lives in the agent's prompt and skills. The orchestrator owns only: issue selection, agent invocation, and validation. + +## Schema Refinement: darkfactory.gts + +Phase 1 defined Project and Ticket card types in `darkfactory.gts` with aspirational fields that were never used. Phase 2 trims these to only the fields that are actually set or read in code, and renames Ticket → Issue to match the issue-driven loop language. + +### Project Card — Trimmed + +**Keep** (actively set or read in bootstrap, prompts, skill loader, or tool builder): + +| Field | Type | Used By | +| ----------------------- | ----------------------------- | ------------------------------------------------ | +| `projectCode` | String | Bootstrap, tests, templates | +| `projectName` | String | Bootstrap, prompts, templates | +| `projectStatus` | ProjectStatusField enum | Bootstrap (set to 'active'), templates | +| `objective` | TextAreaField | Bootstrap (from brief summary), prompts | +| `scope` | MarkdownField | Bootstrap (from brief sections), tests | +| `technicalContext` | MarkdownField | Bootstrap, templates | +| `issues` | linksToMany(Issue) with query | Auto-queried, templates (renamed from `tickets`) | +| `knowledgeBase` | linksToMany(KnowledgeArticle) | Bootstrap, skill loader | +| `successCriteria` | MarkdownField | Bootstrap, prompts | +| `testArtifactsRealmUrl` | StringField | Tool builder (test execution) | + +**Drop** (defined but never set or read by factory code): + +| Field | Why Drop | +| ------------ | --------------------------------------------------- | +| `deadline` | Never set or read | +| `teamAgents` | Only in demo fixtures — never read by factory logic | +| `risks` | Never set or read | +| `createdAt` | Never set or read on Project (Tickets do use it) | + +### Ticket → Issue Card — Renamed and Trimmed + +Rename `Ticket` to `Issue` throughout. Field renames: `ticketId` → `issueId`, `ticketType` → `issueType`. + +**Keep** (actively set or read): + +| Field | Type | Used By | +| -------------------- | ----------------------------- | ------------------------------------------------------------------ | +| `issueId` | String | Bootstrap, tests, templates (was `ticketId`) | +| `summary` | String | Bootstrap, prompts, templates | +| `description` | MarkdownField | Bootstrap, templates | +| `issueType` | IssueTypeField enum | Bootstrap (set to 'feature'), tests (was `ticketType`) | +| `status` | IssueStatusField enum | Bootstrap, factory-implement.ts (updated post-completion), prompts | +| `priority` | IssuePriorityField enum | Bootstrap, prompts, templates | +| `project` | linksTo(Project) | Bootstrap, skill loader | +| `assignedAgent` | linksTo(AgentProfile) | pick-ticket.ts (assignment workflow) | +| `relatedKnowledge` | linksToMany(KnowledgeArticle) | Skill loader (filters skills by knowledge tags) | +| `acceptanceCriteria` | MarkdownField | Bootstrap, prompts | +| `createdAt` | DateTimeField | Bootstrap (set to context.now) | +| `updatedAt` | DateTimeField | Bootstrap (set to context.now) | + +**Drop** (defined but never set or read): + +| Field | Why Drop | +| ---------------- | ------------------------------------------------------------------------------------ | +| `relatedTickets` | Never set or read (Phase 2 uses `blockedBy`/`predecessors` for dependencies instead) | +| `agentNotes` | Never set or read | +| `estimatedHours` | Never set or read | +| `actualHours` | Never set or read | + +### New Fields for Phase 2 + +The issue-driven loop needs dependency tracking fields not in Phase 1: + +| Field | Type | Purpose | +| ----------- | ------------------ | --------------------------------------------------------------------- | +| `blockedBy` | linksToMany(Issue) | Explicit dependency edges — issue can't start until blockers are done | +| `order` | NumberField | Sequence number for tie-breaking when priorities are equal | + +These were described in the "Issue Ordering and Dependencies" section above but need to be added to the Issue card definition. + +### Future: Adopt from Catalog Task Tracker Cards + +The darkfactory Project and Issue definitions are a stopgap — they duplicate fields that should come from the high-quality task tracker cards in the catalog. Longer term, both should `adoptsFrom` the catalog's task tracker card types rather than maintaining their own field definitions. This means: + +- Project adopts from the catalog's Project/Board card (inherits status tracking, team management, etc.) +- Issue adopts from the catalog's Task/Issue card (inherits status workflows, priority, dependencies, etc.) +- darkfactory.gts only adds factory-specific fields (e.g., `testArtifactsRealmUrl`) on top of the inherited base + +This aligns with the catalog-first philosophy: the factory uses the same card types that users create in Boxel, not a parallel schema. It also means improvements to the catalog task tracker (better status workflows, richer dependency modeling) automatically flow into the factory. + +CS-10671 trims and renames the current schema as a first step. The adoption from catalog task tracker cards may happen as part of Phase 2 or as a follow-on — timing TBD. ## Issue Lifecycle @@ -165,6 +284,75 @@ The `LoopAgent` and `runFactoryLoop` signatures don't change — the signal mech ## Boxel-CLI Integration +The boxel-cli integration work is tracked in a dedicated Linear project: **"Incorporate Boxel CLI to Monorepo"**. Key tickets include: + +- **CS-10519** — Import boxel-cli into monorepo as `packages/boxel-cli` +- **CS-10520** — Factory as boxel-cli subcommands; migrate realm-operations; retire file I/O tools +- **CS-10642** — boxel-cli owns full auth lifecycle (realm server tokens, per-realm tokens, auto-acquisition) +- **CS-10613** — Skill alignment: deduplicate, establish consistent homes, create `boxel-api` skill +- **CS-10670** — boxel-cli publishes tool definitions for factory consumption (tool delegation) +- **CS-10666** — Create `boxel-api` skill (federated search, realm creation, auth model) +- **CS-10667** — Create `boxel-command` skill (host commands via prerenderer) +- **CS-10593** — Claude Code native LLM support (ClaudeCodeFactoryAgent) +- **CS-10594** — Codex CLI native support + +### Architectural Principle: boxel-cli Owns the Entire Boxel API Surface + +**Any code that makes an HTTP call to the realm server or Matrix API must live in boxel-cli.** The software factory never calls realm APIs directly — it imports from boxel-cli. This is not a convenience; it is a hard boundary. + +This means: + +- `realm-operations.ts` (20 functions wrapping realm HTTP endpoints) → migrates to boxel-cli +- Auth helpers (`realm-auth.ts`, `boxel.ts` Matrix/OpenID flows) → migrate to boxel-cli +- Skills that teach realm API concepts (search queries, federated endpoints, auth model) → live with boxel-cli +- The factory keeps only orchestration logic: the ralph loop, test execution orchestration, bootstrap flow, and issue scheduling + +The factory becomes a pure consumer of boxel-cli's API layer. It calls `boxel sync`, `boxel pull`, `boxel create`, or imports boxel-cli's programmatic API — it never constructs HTTP requests to realm endpoints. + +### What Migrates from `realm-operations.ts` to boxel-cli + +The `realm-operations.ts` module was designed as a centralized, self-contained set of realm API wrappers with no factory-specific logic. It migrates wholesale: + +| Function | Endpoint | boxel-cli Home | +| ------------------------------- | ---------------------------- | ------------------------------------------------------------- | +| `searchRealm()` | `QUERY /_search` | Evolves into federated search via `/_federated-search` | +| `readFile()` | `GET /` | Absorbed by `boxel pull` / programmatic read API | +| `writeFile()` | `POST /` | Absorbed by `boxel sync` / programmatic write API | +| `deleteFile()` | `DELETE /` | Absorbed by `boxel sync --prefer-local` with deletions | +| `atomicOperation()` | `POST /_atomic` | Already implemented in boxel-cli's batch upload | +| `runRealmCommand()` | `POST /_run-command` | New `boxel command` subcommand (CS-10416) | +| `createRealm()` | `POST /_create-realm` | New `boxel create-realm` subcommand | +| `getServerSession()` | `POST /_server-session` | Part of boxel-cli's auth layer | +| `getRealmScopedAuth()` | `POST /_realm-auth` | Part of boxel-cli's auth layer | +| `cancelAllIndexingJobs()` | `POST /_cancel-indexing-job` | New boxel-cli API | +| `waitForRealmReady()` | `GET /_readiness-check` | New boxel-cli API | +| `waitForRealmFile()` | `GET /` (polling) | New boxel-cli API | +| `pullRealmFiles()` | `GET /_mtimes` + files | Already `boxel pull` (auth managed by boxel-cli per CS-10642) | +| `addRealmToMatrixAccountData()` | Matrix account data API | Part of boxel-cli's auth/profile layer | + +Auth helpers in `realm-auth.ts` and `boxel.ts` (Matrix login, OpenID token, realm server token, per-realm JWTs) also migrate to boxel-cli's auth layer. + +After migration, `realm-operations.ts` is deleted. Direct `fetch()` calls to realm endpoints in `factory-bootstrap.ts` and `factory-target-realm.ts` are replaced with boxel-cli imports. + +### Search Evolves to Federated + +The current `searchRealm()` targets a single specified realm. In boxel-cli, this evolves into a federated search backed by the realm server's `/_federated-search` endpoint, which searches across **all realms the user has access to** using `multiRealmAuthorization`. + +The initial implementation uses `/_federated-search` only. The realm server also exposes `/_federated-search-prerendered`, `/_federated-types`, and `/_federated-info`, but these are not in scope for the initial integration. + +For the locally synced target realm, the LLM uses native grep/find — no API call needed. Federated search is for querying **remote** realms (catalog, base realm, other users' realms). + +### Skill Placement Follows the API Boundary + +Since boxel-cli owns the Boxel API surface, skills that teach realm API concepts live with boxel-cli: + +- **`boxel-api`** (new skill) — search query syntax, federated endpoints, realm creation, auth model. Lives at `packages/boxel-cli/.agents/skills/` +- **CLI command skills** (`boxel-sync`, `boxel-track`, etc.) — already CLI-specific. Live at `packages/boxel-cli/.agents/skills/` +- **Card domain knowledge** (`boxel-development`, `boxel-file-structure`) — not API-specific, applies to anyone working with cards. Lives at root `.agents/skills/` +- **Factory orchestration** (`software-factory-operations`) — ralph loop, factory tools. Lives at `packages/software-factory/.agents/skills/` + +### Background + Phase 1 uses HTTP API calls (`realm-operations.ts`) as the primary realm I/O path. Boxel-cli exists and has profile-based auth, but its auth model isn't flexible enough for the factory's needs — specifically, obtaining auth tokens for newly created realms on the fly. Boxel-cli also lives in a separate repository (`cardstack/boxel-cli`), making it difficult to evolve in lockstep with factory requirements. Phase 2 solves both problems: integrate boxel-cli into the monorepo as a first-class package, extend its auth model to handle dynamically created realms, and use it as the primary realm I/O layer. @@ -194,18 +382,29 @@ Being in the monorepo means: - Boxel-cli gets the same CI rigor as other packages: linting, type-checking, thorough test coverage - Shared types and utilities can be extracted to `runtime-common` instead of being duplicated -### Flexible Auth Support (CS-10529) +### boxel-cli Owns the Full Auth Lifecycle (CS-10642) + +Boxel-cli already has profile-based auth — users log in via `boxel profile add`, and the CLI uses stored credentials to authenticate with realm servers. But the factory creates new realms on the fly and immediately needs to read/write to them. Profile-based auth only knows about realms the user has manually configured. + +The principle that boxel-cli owns the entire Boxel API surface extends to auth. The factory should never touch a JWT directly — boxel-cli manages the full token lifecycle internally: + +1. **Two-tier token model** — boxel-cli understands both realm server tokens (obtained via Matrix OpenID → `POST /_server-session`, grants server-level access) and per-realm tokens (obtained via `POST /_realm-auth`, grants access to specific realms). Both are cached and refreshed automatically. -Boxel-cli already has profile-based auth — users log in via `boxel profile add`, and the CLI uses stored credentials to authenticate with realm servers. But the factory creates new realms on the fly and immediately needs to read/write to them. Profile-based auth only knows about realms the user has manually configured; it has no way to obtain tokens for a realm that was just created seconds ago. +2. **Automatic token acquisition on realm creation** — When `boxel create-realm` creates a new realm, boxel-cli automatically waits for readiness, obtains the per-realm JWT, and stores it in its auth state. Subsequent `boxel pull`/`boxel sync` on that realm Just Work — no `--jwt` flag, no token passing. -CS-10529 extends boxel-cli's auth model to handle this: +3. **Programmatic auth API** — Export a `BoxelAuth` class (or similar) so the factory imports it and never constructs HTTP requests or manages tokens: -1. **Dynamic realm token acquisition** — When boxel-cli authenticates with a realm server (via the existing profile-based flow), it already has a realm server token. After creating a new realm, boxel-cli should automatically obtain and store the per-realm JWT for that realm in its auth state. This means `boxel create` followed by `boxel sync` on the new realm should Just Work — no manual token passing needed. -2. **Realm server token awareness** — Since boxel-cli authenticates with a realm server as part of its profile flow, the realm server URL and token are already known. Commands like `boxel create` should use this existing auth context rather than requiring the realm server URL or token as explicit CLI arguments. -3. **Programmatic auth API** — Export auth helpers from boxel-cli so the factory can call sync/push/pull programmatically with the CLI's auth context, without spawning a subprocess. -4. **Token refresh callback** — Allow callers to provide a function that refreshes expired JWTs, so long-running sync operations don't fail mid-stream. + ```typescript + import { BoxelAuth } from '@cardstack/boxel-cli'; + const auth = new BoxelAuth(credentials); + await auth.createRealm({ name, owner }); // token auto-acquired + await auth.pull(realmUrl, workspaceDir); // uses stored token + await auth.sync(workspaceDir, { preferLocal: true }); + ``` -The key insight is that realm creation and subsequent realm I/O should be a seamless flow within boxel-cli's existing auth model. The factory shouldn't need to manually juggle JWTs — boxel-cli's auth state should absorb newly created realms automatically. +4. **Token refresh for long-running operations** — The factory loop runs for hours. boxel-cli's `RealmAuthClient` already has token refresh with 60s lead time — this extends to cover all realm operations so long-running sessions don't fail mid-stream. + +After this, the factory deletes `realm-auth.ts`, auth portions of `boxel.ts`, and all `authorization`/`serverToken`/`realmTokens` fields threaded through its config types. ### Realm Creation via Boxel-CLI @@ -242,13 +441,17 @@ This means: - **`realm-read`, `realm-write`, `realm-delete`** remain available for operations that must happen immediately on the live realm (e.g., updating a ticket status that another process is watching), but they are no longer the primary I/O path. - **`realm-atomic`** remains for transactional multi-file operations where partial failure is unacceptable. -#### What Stays as Realm API Tools +#### What Stays as Factory Tools (Backed by boxel-cli) + +Some operations are inherently server-side and cannot be replaced by local file I/O. These remain as factory tools but are backed by boxel-cli imports — no direct HTTP calls from the factory: -Some operations are inherently server-side and cannot be replaced by local file I/O: +- **`search_realms`** — federated search across all accessible realms via boxel-cli wrapping `/_federated-search` +- **`run_command`** — host commands via prerenderer, backed by boxel-cli wrapping `/_run-command` +- **`run_tests`** — Playwright orchestration (factory-specific, uses boxel-cli for file pulls) +- **`signal_done`** / **`request_clarification`** — control flow signals back to the ralph loop (factory-only, no API call) +- **`realm-create`** — backed by boxel-cli's `BoxelAuth.createRealm()` with auto token acquisition (CS-10642) -- **`realm-search`** — structured queries against the realm index (type filters, field queries, sorting) -- **`realm-server-session`** / **`realm-auth`** — JWT management (may be absorbed into boxel-cli's auth layer) -- **`pick-ticket`** — ticket queries that filter by status, priority, agent +Auth tools (`realm-server-session`, `realm-auth`) are fully absorbed into boxel-cli's auth layer per CS-10642 — the factory never manages tokens. #### Tool Registry Changes @@ -262,37 +465,115 @@ let allManifests = [...SCRIPT_TOOLS, ...BOXEL_CLI_TOOLS, ...REALM_API_TOOLS]; `BOXEL_CLI_TOOLS` (`boxel-sync`, `boxel-push`, `boxel-pull`, `boxel-status`, `boxel-create`, `boxel-history`) become available to the agent. The factory-level wrapper tools (`write_file`, `read_file`, `search_realm`) can be retired or kept as convenience aliases that delegate to the filesystem + sync. -#### Skill Re-enablement +#### Skill Re-enablement and Alignment (CS-10613) + +The 6 CLI skills excluded in phase 1 (`boxel-sync`, `boxel-track`, `boxel-watch`, `boxel-restore`, `boxel-repair`, `boxel-setup`) are re-enabled in the skill resolver. The `CLI_ONLY_SKILLS` exclusion list in `factory-skill-loader.ts` is removed. -The 6 CLI skills excluded in phase 1 (`boxel-sync`, `boxel-track`, `boxel-watch`, `boxel-restore`, `boxel-repair`, `boxel-setup`) are re-enabled in the skill resolver. The `CLI_ONLY_SKILLS` exclusion list in `factory-skill-loader.ts` is removed, and the keyword-based CLI skill resolution logic is restored. +Beyond re-enablement, CS-10613 performs a full skill alignment: + +- **Deduplication** — 8 of 9 factory skills are identical copies in boxel-cli. Each skill gets a single source of truth. +- **Consistent homes** — Skills are placed based on what they teach: + - CLI commands + realm API → `packages/boxel-cli/.agents/skills/` (boxel-sync, boxel-track, boxel-watch, boxel-repair, boxel-restore, boxel-setup, **boxel-api** NEW) + - Card domain knowledge → root `.agents/skills/` (boxel-development, boxel-file-structure) + - Factory orchestration → `packages/software-factory/.agents/skills/` (software-factory-operations) +- **New `boxel-api` skill** — Consolidates scattered realm API knowledge (search queries, federated endpoints, auth model, realm creation) into a canonical reference at boxel-cli. This fills the current gap where no skill covers federated endpoints, realm creation, or auth flows. +- **Skill content rewrite** — All skills updated to remove references to retired HTTP tools (`write_file`, `read_file`, `search_realm`). Skills teach Boxel-specific domain knowledge only — not how to read/write files (the LLM already knows). +- **Loader updates** — Factory's custom skill loader updated with fallback dirs: primary (software-factory) → fallback 1 (boxel-cli) → fallback 2 (root). Both Claude Code's native loader and the factory's programmatic loader read from the same skill files via symlinks. ### Migration Strategy The refactor happens in stages to avoid a big-bang rewrite: 1. **Stage 1: Monorepo import** — Move boxel-cli into `packages/boxel-cli`. Set up CI (linting, type-checking, tests). All existing factory code continues to use HTTP-based realm operations unchanged. -2. **Stage 2: Auth extension (CS-10529)** — Extend boxel-cli auth to automatically acquire and store tokens for newly created realms. Add programmatic auth API. Factory tests verify that `boxel create` followed by `boxel sync` works seamlessly for factory-created realms. +2. **Stage 2: Auth extension (CS-10642)** — Extend boxel-cli auth to automatically acquire and store tokens for newly created realms. Add programmatic auth API. Factory tests verify that `boxel create` followed by `boxel sync` works seamlessly for factory-created realms. 3. **Stage 3: Sync-based workspace** — Factory entrypoint syncs the target realm to a local workspace before starting the agent loop. Agent writes files locally. A post-iteration sync pushes changes to the realm. 4. **Stage 4: Retire HTTP wrappers** — Remove `realm-operations.ts` stopgap functions (`writeModuleSource`, `readCardSource`, `writeCardSource`, `pullRealmFiles`). Replace with boxel-cli calls. Keep `searchRealm` for structured queries. 5. **Stage 5: Re-enable CLI skills** — Remove the `CLI_ONLY_SKILLS` filter from the skill resolver. Update CLI skill content for the factory agent context. +### Tool Delegation: boxel-cli Publishes Tool Definitions (CS-10670) + +`factory-tool-builder.ts` currently hardcodes every tool's name, description, JSON schema parameters, and execute function (~14 tool definitions). When tools migrate to boxel-cli, the factory shouldn't have to maintain definitions for tools it doesn't own — that creates a coupling problem where parameter changes in boxel-cli require matching updates in the factory. + +The fix: **boxel-cli publishes its own tool surface** and the factory consumes it via delegation. + +boxel-cli exports a function that returns tool definitions: + +```typescript +// In @cardstack/boxel-cli +export function getToolDefinitions(auth: BoxelAuth): BoxelToolDefinition[] { + return [ + { + name: 'search_realms', + description: 'Federated search across all accessible realms', + parameters: { + /* JSON Schema */ + }, + execute: async (params) => auth.federatedSearch(params.query), + }, + { + name: 'run_command', + description: 'Execute a host command via the prerenderer', + parameters: { + /* JSON Schema */ + }, + execute: async (params) => auth.runCommand(params.command, params.input), + }, + // ... all boxel-cli tools, each with schema + implementation + ]; +} +``` + +The factory tool builder becomes a thin composition layer: + +```typescript +// In software-factory +import { getToolDefinitions } from '@cardstack/boxel-cli'; + +function buildTools(auth: BoxelAuth): FactoryTool[] { + const cliTools = getToolDefinitions(auth); // delegated — boxel-cli owns these + const factoryTools = [ + { name: 'signal_done', ... }, // factory-only + { name: 'request_clarification', ... }, // factory-only + { name: 'run_tests', ... }, // factory-specific Playwright orchestration + ]; + return [...cliTools, ...factoryTools]; +} +``` + +This means: + +- **Single source of truth** — boxel-cli owns the name, description, schema, and implementation for its tools +- **Factory tool builder shrinks** — from ~14 manually defined tools to 3-4 factory-specific ones +- **No coupling** — adding or changing a boxel-cli tool automatically reflects in the factory with zero factory code changes +- **Skill alignment** — the `boxel-api` skill (CS-10666) and tool definitions are co-located in boxel-cli, so they stay in sync + +#### Future: boxel-cli as MCP Server + +A natural evolution is for boxel-cli to expose its tools as an **MCP (Model Context Protocol) server**. This would allow Claude Code, Codex CLI, or any MCP-compatible agent to discover and call boxel-cli tools directly — without the factory as intermediary. + +In this model: + +- boxel-cli runs `boxel mcp-server` (or is configured as an MCP server in `.claude/settings.json`) +- Claude Code connects and discovers all available tools: `search_realms`, `create_realm`, `run_command`, `sync`, `pull`, `push`, etc. +- The ralph loop can also connect as an MCP client when invoking the agent, so the agent gets boxel-cli tools alongside factory tools +- Tool definitions, schemas, and descriptions are served dynamically — always up to date + +This ties into CS-10418 (realms exposing MCP servers) and creates a consistent tool discovery pattern across the Boxel ecosystem. The programmatic manifest (Option A above) is the right first step because it's simpler and works today. MCP is the path once the protocol stabilizes and tool discovery becomes the standard for agent runtimes. + ### Impact on `factory-tool-builder.ts` -The factory-level tools evolve: - -| Phase 1 Tool | Phase 2 Replacement | Notes | -| ----------------------- | ---------------------------------------- | -------------------------------------------------------- | -| `write_file` | Filesystem write + `boxel sync` | Agent writes to local workspace | -| `read_file` | Filesystem read (`cat`) | Agent reads from local workspace | -| `search_realm` | `grep`/`find` + `realm-search` | Local search for files, realm API for structured queries | -| `update_ticket` | Filesystem write + `boxel sync` | Or keep as realm-api tool for immediate server update | -| `update_project` | Filesystem write + `boxel sync` | Or keep as realm-api tool for immediate server update | -| `create_knowledge` | Filesystem write + `boxel sync` | Agent writes JSON to local workspace | -| `run_tests` | Playwright runs against local spec files | No need to pull from realm first | -| `signal_done` | Agent updates issue status directly | Signals via issue state, not return type | -| `request_clarification` | Agent tags issue as blocked | Signals via issue state | - -Tools like `update_ticket` and `update_project` may be kept as convenience tools that write directly to the realm API for status updates that need to be immediately visible — but the bulk of file I/O moves to the filesystem. +With tool delegation, the factory only manually defines tools it uniquely owns: + +| Tool | Owner | How it's defined | +| ----------------------- | ---------------- | -------------------------------------------------- | +| `search_realms` | boxel-cli | Delegated via `getToolDefinitions()` | +| `run_command` | boxel-cli | Delegated via `getToolDefinitions()` | +| `create_realm` | boxel-cli | Delegated via `getToolDefinitions()` | +| `run_tests` | software-factory | Manual — factory-specific Playwright orchestration | +| `signal_done` | software-factory | Manual — control flow signal to ralph loop | +| `request_clarification` | software-factory | Manual — control flow signal to ralph loop | + +All retired tools (`write_file`, `read_file`, `search_realm`, `update_ticket`, `update_project`, `create_knowledge`, `create_catalog_spec`, `realm-read`, `realm-write`, `realm-delete`) are gone — replaced by native LLM file I/O + `boxel sync`. ## Open Questions