Put a real-time xAI Grok voice agent on a phone call. Write the agent in
Python — instructions, tools, handlers — call agent.call(to="+1..."), and the
library places an outbound call through Twilio and bridges the caller's
audio to Grok in both directions.
OpenAI-SDK-shaped ergonomics (model, instructions, tools,
tool_handlers), but the runtime is telephony: Twilio Media Streams on one
side, xAI's realtime WebSocket on the other, and your tool code in the middle.
See spec.md for the full design doc.
A phone call can't speak xAI's wire protocol, so a small FastAPI server sits between them and translates:
agent.call(to=…) outbound REST
VoiceAgent ───────────────────────────────────────────────────────► Twilio
(instructions, │
tools, handlers) │ PSTN
▲ ▼
│ tool calls / transcript / events Callee
│ │
│ μ-law 8 kHz ◄── Media Streams WS ──► FastAPI bridge │
│ (transcode + │
└──────────────────────────────────────────────► tool relay) │
│ │
PCM16 24 kHz WS │ │
▼ │
xAI Grok realtime ◄┘
agent.call(transport="twilio", call_details={"to": "+1…"})uses the Twilio REST API to dialtofrom yourTWILIO_FROM_NUMBER, telling Twilio to fetchPUBLIC_BASE_URL/twilio/voice/{session_id}for instructions when the callee picks up.- That endpoint returns
<Connect><Stream>TwiML pointing Twilio at the media WebSocket on the same server. - The bridge accepts the Media Streams socket, opens a server-side xAI
Grok realtime connection (authed directly with
XAI_API_KEY— no ephemeral key, there's no browser here), and runs the two streams concurrently: transcoding caller audio μ-law 8 kHz → PCM16 24 kHz to Grok, and Grok's audio back the other way, with barge-in handling. - Tool calls, transcripts, and errors from Grok are relayed to your
VoiceAgent(handlers run in Python; results are sent back to the model). - On hangup the session closes and a summary (duration, transcript, tool
calls) is available via
agent.get_session(id).summary().
Telephony is the one place audio flows through Python — it has to, to transcode between Twilio's μ-law 8 kHz and Grok's PCM16 24 kHz.
Telephony needs the fastapi extra (FastAPI, uvicorn, websockets):
pip install "git+https://github.com/ashbhat/voiceagentpy.git#egg=voiceagentpy[fastapi]"For local development:
git clone https://github.com/ashbhat/voiceagentpy
cd voiceagentpy
pip install -e ".[fastapi,dev]"Twilio has to reach your app back, so a localhost demo needs a public URL. The
Quickstart uses a free cloudflared
quick tunnel — no Cloudflare account, no config. Install it once:
brew install cloudflared # macOS; see cloudflared docs for other OSes(Only needed for the tunnel path. Skip it if you're deploying behind your own
https:// URL — see Configuration.)
example/ is one self-contained script: it builds the agent,
launches a cloudflared tunnel so Twilio can reach back (no public URL to
provision), places the call, waits for it to finish, prints the summary, and
exits. With cloudflared installed (above):
cd example
python3 -m venv ../.venv && source ../.venv/bin/activate
pip install -r requirements.txt # voiceagentpy[fastapi] + python-dotenv
cp ../.env.example ../.env # fill in XAI_API_KEY + the TWILIO_* vars
python3 main.py +14085987929 # or set CALL_TO in .envYour phone rings; answer and talk to Grok. You do not set
PUBLIC_BASE_URL — the script launches the tunnel and wires it in. Set it only
to use your own https:// URL instead (see Configuration).
See example/README.md for the full walkthrough,
including the inbound webhook (POST /twilio/voice) and a carrier note about
outbound call blocking.
main.py is just the snippet below — a VoiceAgent, the FastAPI app it
serves, and agent.call(...). This is the code you'd write in your own app,
behind your own PUBLIC_BASE_URL instead of the demo tunnel:
import threading, uvicorn
from voiceagentpy import VoiceAgent
from voiceagentpy.fastapi_ext import build_fastapi_app
agent = VoiceAgent(
model="grok-voice-latest",
instructions=(
"You are a friendly phone support agent. Keep replies short and "
"conversational — this is a real phone call. Call a tool the moment "
"you have enough info, before speaking."
),
voice="friendly-support",
tools=[{
"type": "function",
"function": {
"name": "lookup_user",
"description": "Look up a user account by phone number.",
"parameters": {
"type": "object",
"properties": {"phone": {"type": "string"}},
},
},
}],
tool_handlers={
"lookup_user": lambda phone=None, **_: {"name": "Avery", "plan": "pro"},
},
turn_detection={"type": "server_vad"},
)
# Twilio must be able to reach this app at PUBLIC_BASE_URL for the TwiML
# callback + media WebSocket, so run it before placing the call.
app = build_fastapi_app(agent)
threading.Thread(
target=lambda: uvicorn.run(app, host="0.0.0.0", port=8000), daemon=True
).start()
# Place the outbound call. The callee's phone rings and they talk to Grok.
res = agent.call(transport="twilio", call_details={"to": "+1..."})
print("dialing", res.call_sid, "session", res.id)For your own deployment, set PUBLIC_BASE_URL yourself (deployment, ngrok,
etc.) — it must be an https:// URL Twilio can reach back on.
Tools are how the agent does things mid-call — look up an account, check availability, file a ticket. You declare them with the standard OpenAI function-calling JSON schema and register a Python handler per tool:
agent = VoiceAgent(
model="grok-voice-latest",
tools=[{
"type": "function",
"function": {
"name": "book_appointment",
"description": "Book an appointment for the caller.",
"parameters": {
"type": "object",
"properties": {
"date": {"type": "string", "description": "ISO 8601 date"},
"service": {"type": "string"},
},
"required": ["date", "service"],
},
},
}],
tool_handlers={"book_appointment": book_appointment},
)When Grok decides to call a tool mid-conversation, the bridge relays it to the
VoiceAgent, which:
- Parses the JSON arguments and invokes your handler as
handler(**arguments)— sync or async both work. - JSON-encodes the return value and sends it back to the model as the tool result, then nudges the model to continue speaking with the answer.
- Records the call on the session (
session.tool_calls) and emitstool.called/tool.completedevents.
If a handler raises, the exception is caught and returned to the model as
{"error": "<message>"} rather than dropping the call — the agent can recover
gracefully ("sorry, I couldn't reach that system").
Three equivalent ways:
# 1. constructor map
VoiceAgent(..., tool_handlers={"book_appointment": book_appointment})
# 2. decorator (handler for an already-declared tool)
@agent.tool("book_appointment")
def book_appointment(date: str, service: str, **_):
return {"confirmed": True, "date": date, "service": service}
# 3. async handler — awaited automatically
@agent.tool("lookup_user")
async def lookup_user(phone: str = None, **_):
return await db.fetch_user(phone)Handlers should accept **_ (or **kwargs) so an extra/unexpected argument
from the model never raises a TypeError.
Pass default_tool_handler and any tool call without a registered handler is
routed there instead of erroring. The stock mock_tool_response returns a
generic stub, so you can declare the full toolset and exercise the conversation
before writing a single handler:
from voiceagentpy import VoiceAgent, mock_tool_response
agent = VoiceAgent(
model="grok-voice-latest",
tools=[...], # definitions only
tool_handlers={}, # nothing wired yet
default_tool_handler=mock_tool_response, # auto-mocks everything
)default_tool_handler is called as handler(tool_name, arguments) (sync or
async). Specific entries in tool_handlers always win over the default.
Pass event_handler and/or finish_handler to watch a call live and get a
summary at the end:
agent = VoiceAgent(
...,
event_handler=lambda e: print(e["type"], e.get("data")),
finish_handler=lambda s: print("done:", s["session_id"]),
)event_handler receives dicts for session.started / session.ended,
transcript.delta / transcript.final, tool.called / tool.completed,
audio.started / audio.ended, and error. After a call ends:
sess = agent.get_session(res.id)
s = sess.summary() # {"duration_ms", "transcript", "tool_calls", ...}| Env var | Required | Purpose |
|---|---|---|
XAI_API_KEY |
yes | Auth for the Grok realtime connection |
TWILIO_ACCOUNT_SID |
yes | Twilio account (starts AC…) |
TWILIO_AUTH_TOKEN |
yes | Twilio auth token; also validates inbound webhooks |
TWILIO_FROM_NUMBER |
yes | Caller ID for outbound (a local long-code dials more reliably than a toll-free 8xx) |
PUBLIC_BASE_URL |
yes* | https:// URL Twilio uses for the TwiML callback + media WS |
CALL_TO |
no | Default number to dial (the example/ script) |
PORT |
no | Port the FastAPI app binds (default 8000) |
* The example/ script launches a cloudflared tunnel and sets
PUBLIC_BASE_URL for you, so you only set it for your own deployment/tunnel.
Per-call overrides go in call_details: from, account_sid, auth_token,
public_base_url, status_callback.
grok-voice-latest (used above), plus grok-voice-think-fast-1.0 and
grok-voice-fast-1.0. Voice accepts xAI voice names directly or these aliases:
friendly-support, warm, calm-narrator, energetic, neutral.
agent.call(...) is outbound. The same FastAPI app also serves inbound:
point a Twilio number's voice webhook at POST /twilio/voice (signature is
validated when TWILIO_AUTH_TOKEN is set).
The bridge talks to your agent through a ControlPlane seam. The default
InProcessControlPlane is the monolith above; injecting HttpControlPlane
splits the media bridge into a standalone telephony microservice that calls
your agent over HTTP — the same code, a wiring change. See
voiceagentpy.fastapi_ext.build_fastapi_app and
voiceagentpy.telephony.control_plane.
- shipped: Twilio telephony — outbound
agent.call, inbound webhook, FastAPI media bridge, in-process / HTTP control-plane split - next: SHAKEN/STIR + branded caller-ID guidance for production outbound
- next: normalized voice catalog and provider-direct tool webhooks