Problem
Currently, when dealing with massive codebases (e.g., PyTorch, LLM kernels, or complex distributed systems), individual LLM agents often hit "context walls." Even with 128k+ windows, the "Lost in the Middle" phenomenon and the quadratic cost of attention make it impossible to maintain a high-fidelity mental model of the entire repository.
Users need a way to perform cross-module refactoring or feature implementation without manually feeding files into the context or worrying about context overflow and performance degradation.
Proposed change
I propose a SSD (Scalable Sovereign Domains) Multi-Agent architecture. This shifts the paradigm from "one agent fits all" to a structured, hierarchical agent ecosystem within the whale TUI.
Core Mechanisms:
Prefix Sharing (KV Cache Optimization): Use a unified context prefix for macro-level system information (high-level architecture, global constants). Agents branch out from this prefix to handle micro-tasks, minimizing redundant Prefill computations.
Sovereign Domain Mapping: A registry system where each "Expert Agent" is assigned a specific subdirectory or functional module. Agents use a router to identify which expert to "consult" when encountering foreign API calls.
The "Expert-Developer" Separation:
Expert Agents: Long-lived, keeping a 50% context fill rate of module-specific knowledge. Their KV cache state can be offloaded to the server's disk/SSD and reloaded (swapped) instantly when queried (Cache Hit).
Developer Agents: Ephemeral agents created for specific tasks. They start with a lean 10% context, grow as they work, and hand over the task to a successor when reaching an 80% "cognitive load" (context limit) to maintain peak reasoning performance.
Automatic Expert Evolution: Upon task completion, a "Janitor Agent" scans the affected areas to update the Expert Agents' knowledge base, performing context compression or "Expert Splitting" (splitting one module expert into two) if the domain becomes too complex.
Alternatives considered
Naive RAG: Often loses the structural logic and deep dependency chains of complex C++/Python interops.
Long Context Windows: High latency, decreasing accuracy at the tail end, and extremely expensive for iterative development.
Manual Chunking: High cognitive overhead for the user to manage which files the agent sees.
Extra context
This feature should primarily affect:
Interactive TUI behavior: A dashboard to visualize the "Agent Topology" and which experts are currently "swapped into memory."
Future server/headless API surface: The KV cache management and Expert state persistence (the "SSD" part of the proposal) would benefit from a background daemon or server-side state management to ensure low-latency agent switching.
Problem
Currently, when dealing with massive codebases (e.g., PyTorch, LLM kernels, or complex distributed systems), individual LLM agents often hit "context walls." Even with 128k+ windows, the "Lost in the Middle" phenomenon and the quadratic cost of attention make it impossible to maintain a high-fidelity mental model of the entire repository.
Users need a way to perform cross-module refactoring or feature implementation without manually feeding files into the context or worrying about context overflow and performance degradation.
Proposed change
I propose a SSD (Scalable Sovereign Domains) Multi-Agent architecture. This shifts the paradigm from "one agent fits all" to a structured, hierarchical agent ecosystem within the whale TUI.
Core Mechanisms:
Prefix Sharing (KV Cache Optimization): Use a unified context prefix for macro-level system information (high-level architecture, global constants). Agents branch out from this prefix to handle micro-tasks, minimizing redundant Prefill computations.
Sovereign Domain Mapping: A registry system where each "Expert Agent" is assigned a specific subdirectory or functional module. Agents use a router to identify which expert to "consult" when encountering foreign API calls.
The "Expert-Developer" Separation:
Expert Agents: Long-lived, keeping a 50% context fill rate of module-specific knowledge. Their KV cache state can be offloaded to the server's disk/SSD and reloaded (swapped) instantly when queried (Cache Hit).
Developer Agents: Ephemeral agents created for specific tasks. They start with a lean 10% context, grow as they work, and hand over the task to a successor when reaching an 80% "cognitive load" (context limit) to maintain peak reasoning performance.
Automatic Expert Evolution: Upon task completion, a "Janitor Agent" scans the affected areas to update the Expert Agents' knowledge base, performing context compression or "Expert Splitting" (splitting one module expert into two) if the domain becomes too complex.
Alternatives considered
Naive RAG: Often loses the structural logic and deep dependency chains of complex C++/Python interops.
Long Context Windows: High latency, decreasing accuracy at the tail end, and extremely expensive for iterative development.
Manual Chunking: High cognitive overhead for the user to manage which files the agent sees.
Extra context
This feature should primarily affect:
Interactive TUI behavior: A dashboard to visualize the "Agent Topology" and which experts are currently "swapped into memory."
Future server/headless API surface: The KV cache management and Expert state persistence (the "SSD" part of the proposal) would benefit from a background daemon or server-side state management to ensure low-latency agent switching.