
Agent Lightning: Microsoft’s “Trainer Gym” for AI Agents ⚡
Microsoft’s Agent Lightning is an open-source trainer layer for AI agents, using RL and fine-tuning to turn static LangChain/OpenAI agents into learning systems.
Master the art of building production-ready AI Voice Agents in this step-by-step LangChain tutorial.

Imagine calling a sandwich shop. You say, "I want a roast beef with swiss, hold the mayo." The voice on the other end replies instantly, confirms your order, and even jokes about your excellent cheese choice.
Now, imagine that "person" is code you wrote.
Voice AI is the new frontier. But building an agent that feels human—one that listens, thinks, and speaks in milliseconds—is a massive engineering challenge.
In this guide, we’re breaking down how to build a production-ready Voice Agent using LangChain, handling everything from real-time transcription to tool calling.
When building Voice Agents, you have two choices:
We are building a Sandwich. Why? Because it gives you the flexibility to swap out "ingredients" (models) and use the latest reasoning engines (like GPT-4o or Claude 3.5) without waiting for a vendor to update their all-in-one model.

To make this work, we need five distinct components working in perfect harmony. The diagram below illustrates the flow of data through these layers.
In a normal conversation, humans expect a reply within 250ms to 750ms. If your bot takes 2 seconds to "think," the illusion breaks. It feels sluggish and robotic.
To fight latency, we don't wait for one step to finish before starting the next. We stream everything.
Instead of passing full files, we pass a continuous stream of events. The diagram below visualizes how different parts of the system process data in parallel.
This pipeline approach drastically reduces "Time to First Byte" (TTFB), making the conversation feel snappy and real.
In the demo, we built a bot for a sandwich shop using LangChain's create_agent pattern.
We use a lightweight server (like Hono) to handle a WebSocket connection. This connection is the lifeline, carrying audio events back and forth.
The secret sauce is the middleware. It sits between your text-based LLM agent and the voice inputs. It handles the messy work of:
This isn't just a chatbit; it's an Agent. It has tools!
add_to_order: Adds items to the user's cart.confirm_order: Finalizes the transaction.Because we are using the "Sandwich Method," we can use powerful reasoning models that are excellent at strict tool calling, ensuring your bot doesn't accidentally order a "ham sandwich with a side of tires."
Voice Agents are moving beyond simple "Call confirm" trees. We are entering an era of:
By building with a modular LangChain architecture, you future-proof your application. When a better Text-to-Speech model comes out next week? You just swap the ingredient.
Check out the reference code to see the Event Stream in action. It’s time to give your code a voice.
Start building today and let the world hear what you've created!
Continue exploring these related topics

Microsoft’s Agent Lightning is an open-source trainer layer for AI agents, using RL and fine-tuning to turn static LangChain/OpenAI agents into learning systems.

The new model understands intent better than you can type it. Here is why the days of fighting with complex prompts might finally be over.

Discover OpenAI’s ChatGPT Atlas, the world’s first AI-native browser integrating ChatGPT, memory, and agent actions—now available on macOS.