Skip to content

Research: Cloudflare Agents voice API as native voice layer for aegis-oss agents #39

@stackbilt-admin

Description

@stackbilt-admin

Overview

Cloudflare's @cloudflare/voice package (part of the Agents platform, currently Beta) provides a complete voice pipeline — STT, LLM turn handling, TTS, and conversation persistence — built on Durable Objects. Evaluate how this fits as an optional voice capability for aegis-oss agents.

Docs: https://developers.cloudflare.com/agents/api-reference/voice/


What the API provides

// Server-side: extend a DO with voice
export class MyAgent extends withVoice(AgentBase) {
  transcriber = new WorkersAIFluxSTT(env.AI);
  tts = new WorkersAITTS(env.AI);

  async onTurn(transcript: string, ctx: TurnContext) {
    // plug into aegis-oss memory + dispatch here
    return this.kernel.dispatch(transcript);
  }
}

// Client-side: React hook
const { status, transcript, startCall, endCall } = useVoiceAgent(agentUrl);

Key capabilities:

  • Binary PCM audio over WebSocket, sentence-chunked streaming TTS
  • Interruption handling (user speech aborts in-flight LLM/TTS)
  • SQLite-backed conversation history via DO storage
  • Hooks: afterTranscribe, beforeSynthesize, onInterrupt, lifecycle
  • Workers AI STT/TTS (free tier) or Deepgram/ElevenLabs/Twilio third-party

Design questions for aegis-oss

  • Composition model: Should withVoice wrap the agent's DO class, or should aegis-oss expose a VoiceCapability mixin that delegates to CF's mixin internally?
  • Memory integration: The afterTranscribe hook is the natural injection point for aegis-oss memory context. How do we surface this cleanly?
  • onTurn → kernel dispatch: onTurn returns a string/stream. Does this map 1:1 to the aegis-oss executor interface, or do we need an adapter?
  • Packaging: Core bundle vs. optional @aegis-oss/voice addon package?
  • Provider abstraction: Should aegis-oss abstract over CF's STT/TTS providers, or pass through CF's interface directly?
  • Framework-agnostic client: VoiceClient (non-React) may matter for agents used outside browser contexts

Opportunity

The CF voice API aligns closely with aegis-oss's DO-native architecture. Shipping voice as a first-class optional capability would differentiate aegis-oss from other agent frameworks that require external voice infra.


Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions