Problem Statement
OpenAI Gym abstraction is built for serial MDP execution: one action yields one observation, so the execution order is action -> observation -> action -> observation -> ...
Parallel execution (subagents, multi-agents) is important for scaling test-time compute and keeping wall-clock time down.
As a simple example, if you take a step() and the action is to spawn two subagents, then naturally two observations will be produced: one for each subagent prompt. But the interface as-is returns only one observation, and you can't get two observations without submitting two actions.
Proposed Solution
Change from a single step() function to a producer-consumer architecture with decoupled act() and observe(). The user can then maintain both an observe asyncio task and multiple act tasks at all times, and thus any execution order can happen such as action -> observation -> observation -> action -> action -> ...
For RL purposes, the MDP abstraction still fits mathematically if you take your trajectory to be ordered either by the observation time or the action time. Prefix merging for training also still works if you use a trie or rolling hash or similar, in order to match prefixes which are not directly the previous request.
Alignment with Principles
Addresses "Design for LLMs" by acknowledging that LLM inference and tool execution has long and variable-length latency which makes serial MDP execution less viable for larger workflows
Bends "Simple Gymnasium-style API"; I claim that departing from the step() API is necessary to support concurrency.
Principles Addressed
Trade-offs
I'm not aware of any alternatives.
Impact
Breaks the current API contract.
Files/Systems Affected
All
Related RFCs
The only search result for "concurrency" is #194 , which I don't believe is related.
Implementation Sketch
observations: asyncio.Queue[tuple[UUID, Observation]]
actions: asyncio.Queue[tuple[UUID, Action]]
async def observe() -> tuple[UUID, Observation]:
return await self.observations.get()
def act(observation_uuid: UUID, action: Action) -> None:
self.actions.put_nowait((observation_uuid, action))
async def event_loop() -> None: ... # await self.actions.get(), self.observations.put_nowait()
Open Questions
N/A
Problem Statement
OpenAI Gym abstraction is built for serial MDP execution: one action yields one observation, so the execution order is action -> observation -> action -> observation -> ...
Parallel execution (subagents, multi-agents) is important for scaling test-time compute and keeping wall-clock time down.
As a simple example, if you take a
step()and the action is to spawn two subagents, then naturally two observations will be produced: one for each subagent prompt. But the interface as-is returns only one observation, and you can't get two observations without submitting two actions.Proposed Solution
Change from a single
step()function to a producer-consumer architecture with decoupledact()andobserve(). The user can then maintain both an observe asyncio task and multiple act tasks at all times, and thus any execution order can happen such as action -> observation -> observation -> action -> action -> ...For RL purposes, the MDP abstraction still fits mathematically if you take your trajectory to be ordered either by the observation time or the action time. Prefix merging for training also still works if you use a trie or rolling hash or similar, in order to match prefixes which are not directly the previous request.
Alignment with Principles
Addresses "Design for LLMs" by acknowledging that LLM inference and tool execution has long and variable-length latency which makes serial MDP execution less viable for larger workflows
Bends "Simple Gymnasium-style API"; I claim that departing from the step() API is necessary to support concurrency.
Principles Addressed
Trade-offs
I'm not aware of any alternatives.
Impact
Breaks the current API contract.
Files/Systems Affected
All
Related RFCs
The only search result for "concurrency" is #194 , which I don't believe is related.
Implementation Sketch
Open Questions
N/A