🎙️ Build Your Own AI Voice Agent: The 'Sandwich' Revolution

3 min read

Master the art of building production-ready AI Voice Agents in this step-by-step LangChain tutorial.

🎙️ Build Your Own AI Voice Agent: The 'Sandwich' Revolution

Imagine calling a sandwich shop. You say, "I want a roast beef with swiss, hold the mayo." The voice on the other end replies instantly, confirms your order, and even jokes about your excellent cheese choice.

Now, imagine that "person" is code you wrote.

Voice AI is the new frontier. But building an agent that feels human—one that listens, thinks, and speaks in milliseconds—is a massive engineering challenge.

In this guide, we’re breaking down how to build a production-ready Voice Agent using LangChain, handling everything from real-time transcription to tool calling.


🥪 The "Sandwich" Architecture

When building Voice Agents, you have two choices:

  1. Real-Time Models: One giant model that does it all (Audio In → Audio Out). Fast, but rigid.
  2. The Sandwich Method: A modular pipeline where you control every layer.

We are building a Sandwich. Why? Because it gives you the flexibility to swap out "ingredients" (models) and use the latest reasoning engines (like GPT-4o or Claude 3.5) without waiting for a vendor to update their all-in-one model.

https://res.cloudinary.com/dkdxvobta/image/upload/v1765365858/voice_agent_leognj.png

The 5 Layers of the Stack

To make this work, we need five distinct components working in perfect harmony. The diagram below illustrates the flow of data through these layers.

  1. VAD (Voice Activity Detection): The ears. It detects when you stop talking so the agent knows when to reply.
  2. STT (Speech-to-Text): The scribe. Converts your audio stream into text. (e.g., Deepgram, AssemblyAI)
  3. The Brain (LangChain Agent): The logic. It takes the text, decides what to do (like checking inventory), and generates a response.
  4. TTS (Text-to-Speech): The voice. Converts the AI’s text response back into human-sounding audio. (e.g., Cartesia, ElevenLabs)
  5. The Transport: The highway. Usually WebSockets or WebRTC to move data instantly between client and server.

⚡ The Enemy: Latency

In a normal conversation, humans expect a reply within 250ms to 750ms. If your bot takes 2 seconds to "think," the illusion breaks. It feels sluggish and robotic.

To fight latency, we don't wait for one step to finish before starting the next. We stream everything.

The Event Stream Pattern

Instead of passing full files, we pass a continuous stream of events. The diagram below visualizes how different parts of the system process data in parallel.

  • User speaks: Audio bytes stream to the Server.
  • STT: Starts transcribing while the user is still talking.
  • Agent: Receives the transcript stream.
  • TTS: Starts generating audio for the beginning of the sentence before the LLM has finished thinking of the end of the sentence.

This pipeline approach drastically reduces "Time to First Byte" (TTFB), making the conversation feel snappy and real.


🛠️ The Build: A Sandwich Ordering Bot

In the demo, we built a bot for a sandwich shop using LangChain's create_agent pattern.

1. The Setup (Hono & WebSockets)

We use a lightweight server (like Hono) to handle a WebSocket connection. This connection is the lifeline, carrying audio events back and forth.

2. The Middleware Magic

The secret sauce is the middleware. It sits between your text-based LLM agent and the voice inputs. It handles the messy work of:

  • Buffering audio.
  • Managing interruptions (if the user cuts the bot off).
  • Aggregating streams.

3. Tool Calling

This isn't just a chatbit; it's an Agent. It has tools!

  • add_to_order: Adds items to the user's cart.
  • confirm_order: Finalizes the transaction.

Because we are using the "Sandwich Method," we can use powerful reasoning models that are excellent at strict tool calling, ensuring your bot doesn't accidentally order a "ham sandwich with a side of tires."


🚀 Why This Matters

Voice Agents are moving beyond simple "Call confirm" trees. We are entering an era of:

  • Rich Customer Support: Agents that can actually fix problems in your account.
  • Complex Commerce: Ordering complex products with natural language.
  • Companionship: AI that remembers context and understands nuance.

By building with a modular LangChain architecture, you future-proof your application. When a better Text-to-Speech model comes out next week? You just swap the ingredient.

Ready to Build?

Check out the reference code to see the Event Stream in action. It’s time to give your code a voice.


Start building today and let the world hear what you've created!

Related Articles

Continue exploring these related topics