Skip to content

Phase D — migrate aegis-oss to @stackbilt/llm-providers (remove all bolted-in LLM logic) #24

@stackbilt-admin

Description

@stackbilt-admin

Summary

Multi-session epic to remove all bolted-in LLM inference logic from aegis-oss and the downstream AEGIS daemon, consuming `@stackbilt/llm-providers` as the canonical routing layer. Per the architectural rule: no Stackbilt repo (public or private) may contain its own bolted-in LLM inference, routing, failover, or provider-specific orchestration logic. llm-providers is the SoT.

This is a four-session epic. aegis-oss is the contract repo per `project_dependency_model.md` — migration lands here first, then the daemon inherits via `@stackbilt/aegis-core`.

Current migration state (as of 2026-04-10, daemon v1.96.2)

Already on llm-providers (done):

  • `kernel/executors/cerebras.ts` — uses `CerebrasProvider`
  • `kernel/executors/groq.ts` — uses `GroqProvider` (both plain and tool-use variants)
  • `kernel/resilience.ts` — re-exports `CircuitBreakerManager`, `CostTracker`, `CreditLedger`, `ExhaustionRegistry` from llm-providers

Not yet migrated (the real Phase D scope):

  1. Anthropic — `web/src/claude.ts` / `executeClaudeChat` still uses raw Anthropic SDK. `AnthropicProvider` exists in llm-providers.
  2. Workers AI / GPT-OSS — `executeGptOss` / `executeWorkersAi` use raw `env.ai?.run()`. `CloudflareProvider` exists in llm-providers.
  3. Dispatch routing policy — the daemon downstream has a Cerebras remap (`if plan.executor === 'claude' → cerebras_mid`) that intercepts semantic executor names inside the dispatch switch. This is policy, not LLM logic, and should become a thin routing adapter above the llm-providers factory instead of an intercept inside the executor switch.

Architectural concern: executor naming abstraction

Current dispatch uses semantic executor names (`claude_opus`, `cerebras_reasoning`, `cerebras_mid`, `claude_code`) that encode strategy + tier + capability. llm-providers routes by provider + model + fallback chain. Phase D needs a lightweight routing policy adapter in aegis-oss that maps semantic names → (provider, model, fallback chain) tuples. This adapter becomes part of the canonical contract; the daemon inherits it and can optionally supply custom presets for daemon-specific strategies.

This is the design work that makes Phase D non-mechanical.

Dependencies on llm-providers

Dependencies on edge-auth

  • `Stackbilt-dev/edge-auth#82` — canonical `ResourceQuotaProvider` contract. Independent of Phase D execution, but the QuotaHook wiring in Session D.3+ will consume it.

Plan

Session D.1 — aegis-oss: routing adapter + Anthropic migration

  • Design the executor routing adapter (semantic names → provider/model/fallback tuples)
  • Port `executeClaudeChat` to use `AnthropicProvider` from llm-providers
  • Preserve: MCP integration, streaming (gated on llm-providers#26), cost tracking, circuit breakers
  • Update canonical dispatch tests to pass against the new adapter
  • Publish `@stackbilt/aegis-core` with the migrated code

Session D.2 — aegis-oss: Workers AI migration

  • Port `executeGptOss` / `executeWorkersAi` to `CloudflareProvider`
  • Retire raw `env.ai?.run()` usage
  • Update tests
  • Publish new `@stackbilt/aegis-core` version

Session D.3 — daemon (Stackbilt-dev/aegis): inherit via dependency model

  • Remove daemon's Cerebras remap intercept
  • Delete daemon's `web/src/claude.ts` and `web/src/kernel/executors/workers-ai.ts`
  • Keep daemon-specific Cerebras tier presets if they differ from aegis-oss defaults (inject as custom presets on the factory)
  • Restore canonical dispatch tests — they should start passing once the bolted-in logic is gone
  • Bump daemon to consume the new `@stackbilt/aegis-core`
  • Deploy

Session D.4 — validation + policy enforcement

  • Integration test end-to-end: chat streaming, tool-use, failover scenarios
  • Grep across aegis-oss + daemon + foodfiles + img-forge for direct `@anthropic-ai/sdk` / `groq-sdk` / raw `env.ai?.run()` imports — delete any that remain
  • Add a lint rule (or CI check) that fails on raw provider SDK imports outside of `@stackbilt/llm-providers`
  • Close this epic
  • Close corresponding daemon kernel shadow (`dispatch.ts +366L` in `project_daemon_kernel_shadow`)

Related daemon work

  • 1.96.1 (2026-04-10) — Phase A kernel shadow cleanup. Closed 4 of ~10 shadows. Phase D.3 closes the dispatch.ts shadow as a side effect.
  • 1.96.2 (2026-04-10) — AI Gateway account-ID shadow collapse (5th shadow). Fixed a latent bug where the wrong CF account ID was hardcoded; now pulls from `env.CF_ACCOUNT_ID` inherited via Phase A.

Internal memory references (AEGIS context)

  • `project_phase_d_llm_providers.md` — full scoping with gap analysis and 4-session breakdown
  • `project_resource_quota_seam.md` — fractal multi-tenant quota architecture
  • `project_daemon_kernel_shadow.md` — broader daemon→aegis-oss shadow cleanup context
  • `feedback_no_bolted_llm_logic.md` — the architectural rule triggering this work

Priority

Design-heavy, multi-session. Not a cc-taskrunner candidate — this needs dedicated Claude Code sessions for each phase. Track here, execute as sessions become available.

🤖 Filed by AEGIS during Phase D scoping session

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions