
Build an AI Voice Agent with Twilio and a Realtime Model
Wire a phone number to an AI: a Node tutorial connecting Twilio Media Streams to a realtime speech model so callers can talk to your agent.
Key takeaways
- The whole agent is a TwiML webhook returning a Connect Stream and a WebSocket that relays base64 audio between Twilio and a realtime model.
- Twilio sends G.711 mu-law at 8kHz mono, so if your model speaks mu-law natively you pass the bytes through and write no audio transcoding code.
- Callers feel something is wrong at about 500 milliseconds of silence, so stream audio back the instant the first delta arrives and pick a model with low time-to-first-audio.
- For barge-in, send Twilio a clear message when the model detects speech started, or the agent keeps playing its old sentence over the caller.
A voice agent is software that picks up a phone call and holds a conversation. Someone dials your number, your AI says hello, they ask a question out loud, and the AI answers out loud. No menu trees, no "press 1 for billing." Just talking.
There are three moving parts and they don't naturally fit together. First, telephony: the phone network, which Twilio gives you a handle on through a regular phone number. Second, audio streaming: getting the caller's voice off the phone network and into your code as it happens, then pushing the AI's voice back the other way. Third, the brain: a realtime speech-to-speech model that takes audio in and gives audio out, no transcription step that you have to babysit.
What you'll build in this post is the glue between those three. A small Node server with two jobs. One HTTP endpoint hands Twilio a bit of XML telling it to open a media stream. One WebSocket endpoint sits in the middle, shuffling audio frames between Twilio and the model. That's genuinely most of it. The interesting work is in the details of the audio format and the timing, which is where I'll spend the back half of the post.
I'm assuming you've built a Node service before and aren't scared of a WebSocket. I'm not assuming you know anything about Twilio's streaming protocol, because almost nobody does until they need it.
What you'll need
A Twilio account with a phone number that can receive voice calls. The trial tier is fine for this. Buy a number, note it down.
An API key for a realtime speech model. I'm keeping the model provider out of this on purpose, because the integration shape is the same across the ones I've used: you open a WebSocket to the provider, send them audio chunks, and they send you back audio chunks plus some control events. Wherever you see a placeholder URL or a vendor-specific event name below, swap in your provider's actual values from their docs. The mechanics don't change.
A public URL. Twilio is out on the internet and needs to reach your laptop, so during development you'll tunnel with ngrok or similar:
ngrok http 5050
That prints a forwarding host like https://a1b2c3d4.ngrok-free.app. The WebSocket version of that same host is wss://a1b2c3d4.ngrok-free.app. Keep both handy, you'll wire them into Twilio in a minute.
Node 18 or newer, and one dependency:
npm install ws
That's the whole shopping list.
How a call actually flows
Before the code, here's the sequence, because it helped me to see it written down once.
- Someone calls your Twilio number.
- Twilio makes an HTTP request to your webhook asking "what do I do with this call?"
- You answer with TwiML, Twilio's little XML dialect, telling it to open a media stream to your WebSocket.
- Twilio connects to your WebSocket and starts firing JSON messages at you, including the caller's audio in tiny chunks.
- You relay that audio to the model's WebSocket.
- The model streams audio back to you.
- You relay that audio to Twilio, which plays it down the phone line.
Steps 4 through 7 repeat for the whole call, many times a second. Now the parts.
The TwiML webhook
When a call comes in, Twilio hits your webhook and waits for instructions. The instruction we want is <Connect><Stream>, which opens a bidirectional media stream to a WebSocket of our choosing. "Bidirectional" matters: the older <Start><Stream> only forks the audio to you for listening. <Connect><Stream> lets you send audio back, which is the entire point here.
import express from "express";
const app = express();
app.post("/incoming-call", (req, res) => {
const host = req.headers.host;
const twiml = `<?xml version="1.0" encoding="UTF-8"?>
<Response>
<Connect>
<Stream url="wss://${host}/media-stream" />
</Connect>
</Response>`;
res.type("text/xml").send(twiml);
});
app.listen(5050, () => console.log("HTTP server on :5050"));
Set this endpoint as the "A call comes in" webhook on your Twilio number, using your ngrok HTTPS URL plus /incoming-call. I'm building the wss:// URL from the incoming request's Host header so the same code works behind ngrok today and behind a real domain later, without me editing a hardcoded string and forgetting.
One thing worth knowing: this XML response is a one-shot handoff. Once Twilio reads the <Connect><Stream> it opens the socket and stops talking to this HTTP endpoint. Everything after this happens over the WebSocket.
The WebSocket server
This is the part that earns its keep. I'll attach a ws server to the same HTTP server so Twilio's wss://.../media-stream lands here.
import { WebSocketServer } from "ws";
import { createServer } from "http";
import express from "express";
const app = express();
// ... the /incoming-call handler from above ...
const server = createServer(app);
const wss = new WebSocketServer({ server, path: "/media-stream" });
server.listen(5050, () => console.log("HTTP + WS on :5050"));
Now, what does Twilio send once it connects? A stream of JSON text messages. Each has an event field that tells you which kind it is. There are four you care about:
connected, the socket is up. Arrives once, carries no audio. Mostly a heartbeat you can ignore.start, the stream is starting. This one is important because it carries thestreamSid, an identifier you must echo back on every audio frame you send. It also tells you the media format.media, a chunk of the caller's audio. The bytes live inmedia.payloadas a base64-encoded string. These arrive continuously, roughly every 20 milliseconds.stop, the caller hung up or the stream ended. Time to clean up.
Here's the shape of a start and a media message, trimmed to what matters:
{
"event": "start",
"start": {
"streamSid": "MZ18ad3ab5...",
"callSid": "CA...",
"mediaFormat": { "encoding": "audio/x-mulaw", "sampleRate": 8000, "channels": 1 }
}
}
{
"event": "media",
"media": {
"track": "inbound",
"chunk": "5",
"timestamp": "1043",
"payload": "/+5+/v7+/v..."
}
}
Note mediaFormat: audio/x-mulaw at 8000 Hz, mono. Hold that thought, it comes back to bite us in the gotchas section.
Wiring the two sockets together
The handler opens a connection to the model the moment Twilio connects, then relays in both directions. I'll show the whole thing and then walk through it.
import WebSocket from "ws";
wss.on("connection", (twilioWs) => {
let streamSid: string | null = null;
// Open the model connection up front so it's warm by the time
// the caller starts talking.
const modelWs = new WebSocket("wss://your-model-provider.example/realtime", {
headers: { Authorization: `Bearer ${process.env.MODEL_API_KEY}` },
});
modelWs.on("open", () => {
// Tell the model what audio you're sending and what you want back.
// These field names are illustrative — copy yours from the provider's docs.
modelWs.send(JSON.stringify({
type: "session.update",
session: {
input_audio_format: "g711_ulaw", // μ-law 8kHz, matches Twilio
output_audio_format: "g711_ulaw",
voice: "alloy",
instructions: "You are a friendly phone receptionist. Keep replies short.",
},
}));
});
// Twilio -> model
twilioWs.on("message", (raw) => {
const msg = JSON.parse(raw.toString());
switch (msg.event) {
case "start":
streamSid = msg.start.streamSid;
break;
case "media":
if (modelWs.readyState === WebSocket.OPEN) {
modelWs.send(JSON.stringify({
type: "input_audio_buffer.append",
audio: msg.media.payload, // already base64 μ-law, pass it straight through
}));
}
break;
case "stop":
modelWs.close();
break;
}
});
// model -> Twilio
modelWs.on("message", (raw) => {
const event = JSON.parse(raw.toString());
// The model streams its speech back as audio deltas.
if (event.type === "response.audio.delta" && streamSid) {
twilioWs.send(JSON.stringify({
event: "media",
streamSid,
media: { payload: event.delta }, // base64 μ-law
}));
}
});
twilioWs.on("close", () => modelWs.close());
modelWs.on("close", () => twilioWs.close());
});
The relay itself is dumb on purpose, and that's the right instinct. When a media event comes in from Twilio, you grab its base64 payload and hand it to the model. When the model emits an audio delta, you wrap it in a media message and send it back. The format that Twilio sends back to you is the same { event: "media", streamSid, media: { payload } } envelope you saw it use, which is a nice symmetry once you notice it.
The thing I want to flag is that I asked the model for g711_ulaw audio in both directions in the session.update. When the provider can speak μ-law at 8kHz natively, the payloads flow through untouched and you write zero audio code. That's the happy path, and you should reach for it whenever your provider offers it. If it doesn't, you transcode, which is the first gotcha.
Opening modelWs immediately on connection, rather than waiting for the first media frame, buys you a little headroom. The model's socket handshake and session setup take a moment, and you'd rather pay that cost while the phone is still ringing in the caller's ear than after they've said "hi."
Gotchas
This is the section I wish someone had handed me before I started, so it's the longest one.
Audio format: μ-law, 8kHz, and why it's annoying
Twilio's phone audio is G.711 μ-law, 8kHz, mono. This is not a modern format. It's the codec of the public switched telephone network, which is old, and it sounds like a phone call because it is one. 8kHz means you get frequencies up to 4kHz, which is enough for speech and nothing else.
If your model speaks μ-law natively, you're done, pass the bytes through as I did above. If it wants something else, usually 16-bit PCM at 16kHz or 24kHz, you have to transcode in both directions: decode μ-law to PCM and resample up on the way in, resample down and encode to μ-law on the way out. μ-law decode and encode are simple byte-level table operations and you can write them by hand, but resampling between 8kHz and 24kHz is where you'll want a real library rather than naive sample dropping, otherwise the audio gets gritty. Get this wrong and the symptom is distinctive: the agent sounds like a chipmunk or a demon, which means your sample rate assumption is off by a factor somewhere.
Match the rates explicitly at every boundary. Most of my early bugs here were one side assuming 8kHz and the other assuming 16kHz.
Latency, the one that actually hurts
Here's the honest part. Latency is the hard problem in voice agents and no amount of clean code makes it disappear. A human on a phone call starts to feel something is wrong at about 500 milliseconds of silence after they stop talking, and they'll talk over the agent or repeat themselves past a second. Your budget gets eaten by: the audio traveling phone network to Twilio to you, you to the model, the model deciding the caller finished a sentence (endpointing), the model generating speech, and all of it coming back. Those add up faster than you'd hope.
What helped me, in order of impact: pick a model with low time-to-first-audio, not just good quality. Run your relay server in a region close to your model provider, the hop between your server and the model is one you control. Don't add buffering you don't need; every queue you insert is latency you chose. And stream the audio back the instant the first delta arrives rather than waiting for a complete response. The relay above does the last one already, which is most of the win.
I won't pretend you'll get it to feel like a human. You'll get it to feel like a decent phone system, and that's a real bar to clear.
Barge-in: letting the caller interrupt
People interrupt. They'll start answering before the agent finishes its sentence, and a good agent shuts up and listens. This is called barge-in, and it doesn't happen for free.
Two things have to cooperate. The model needs to detect that the caller started speaking and stop generating, most realtime models emit some kind of "speech started" event for this. And you need to flush the audio you've already sent to Twilio, because Twilio buffers what you give it and will happily keep playing the agent's old sentence over the caller's new one. Twilio gives you a clear message for exactly this:
// When the model says the caller started talking, dump Twilio's playback buffer.
if (event.type === "input_audio_buffer.speech_started" && streamSid) {
twilioWs.send(JSON.stringify({ event: "clear", streamSid }));
}
Skip the clear and barge-in feels broken in a way users can't articulate but definitely notice: they talk, the agent keeps droning, and the call turns into a mess. The clear is small and easy to forget, and forgetting it is one of the more common reasons a demo that worked solo falls apart with a real, impatient caller.
Ending the call cleanly
A call can end three ways: the caller hangs up (you get a stop event, then the socket closes), the model finishes and you decide to wrap up, or something errors out. Close both sockets on any of them. The close handlers above cross-wire the two so neither one leaks when its partner drops. If you want the agent to end the call itself, say after a goodbye, use Twilio's REST API to update the call, the media stream alone can't hang up the phone.
Cost
Two meters run at once during every call: Twilio bills for the inbound minutes and the media stream, and your model provider bills for realtime audio, which is priced well above text tokens. A handful of test calls costs pocket change. A bug that leaves sockets open after a caller leaves does not, because the model keeps a live audio session running and billing while nobody's there. Make sure stop and close actually tear everything down, then watch your provider dashboard after your first few real calls. I've left a meter running overnight once. Once was enough.
Wrapping up
The architecture is smaller than it looks: a webhook that returns six lines of XML, and a WebSocket that copies base64 strings from one socket to another while echoing the streamSid. If your model speaks μ-law you barely touch the audio at all.
The work that remains after you've got it talking is the work that matters: shaving latency, handling interruptions so the thing feels alive, and tightening the prompt so the agent stays on task instead of wandering. Start with the relay in this post, call your own number, and listen. The first time it answers and talks back, you'll know exactly what to fix next, because your own ear will tell you.