When ChatGPT launched, it unlocked the power of frontier AI models in an interface familiar to everyone. Chat adoption took off like wildfire, and one underappreciated reason it scaled so fast is that it was convenient to build and deploy.
A chatbot is essentially an HTTP endpoint. Setting aside the enormous complexity of training and serving the models themselves, the applications built on top of them run on the same stateless, request–response stack that has powered web applications for decades. Engineers did not need to learn anything new at the application layer, and operations teams did not need to support unfamiliar patterns. AI applications exploded because an intuitive interface met infrastructure that was already everywhere.
Voice AI was the first use case where that stopped being true. In doing so, it exposed the limits of the infrastructure AI has quietly been relying on.
Why Voice Broke the Pattern
A voice agent is a persistent process: a small program that starts when a conversation begins and keeps running until the conversation ends. While it is alive, it continuously processes incoming audio, runs it through speech recognition, generates responses through a language model, and converts those responses back into audio in real time.
It holds conversation state in memory and coordinates multiple models running concurrently. You could run one on your laptop, talk to it from your phone, and watch the process sit there the whole time—just a program, running.
This is a clear break from how backend software has worked for the past twenty years.
The Stateless Web Stack Meets its Limits
Over the last two decades, the industry converged on a familiar set of best practices: microservices and stateless request handlers behind load balancers. Each request is independent, irrespective of which server handles it. You cannot rely on in-process state because the next request might land somewhere else.
Everything is designed to scale horizontally: add more instances when traffic spikes, scale them back down when it drops. This model won because it works extremely well for web applications, and it became the de facto architecture of nearly every major application.
A voice agent does not fit this model at all. Each session is its own long-running process with its own state. If you store something in a variable at the start of the conversation, it is still there ten minutes later. The protocols are different: you need WebRTC and real-time media transport, not just REST APIs. The scaling assumptions are different: you cannot treat these processes as interchangeable and stateless, because they are not. And the performance demands are punishing—users expect sub-second responsiveness while the system handles interruptions, background events, and multiple concurrent audio streams.
Most infrastructure teams have deep expertise in the HTTP world and almost none in this one.
Voice Teams Had to Reinvent the Stack
Teams building voice systems therefore had to solve hard problems from scratch: deploying and scaling long-lived, stateful processes; routing real-time media with minimal latency; and observing what is happening inside a session that unfolds over minutes rather than a transaction that completes in milliseconds.
For a while, this looked like a niche problem specific to voice. It is not anymore. Voice did not just introduce new requirements—it revealed where the existing stateless model breaks down.
ALSO READ: The Security Gap Enterprises Are Creating as They Scale AI Agents
Agents Are Becoming Long‑Lived Processes
Look at the agents that are breaking out right now. Claude Code runs as a long-lived process in your terminal. OpenClaw runs continuously on your machine, maintaining state across sessions. OpenAI’s Codex spins up dedicated sandbox environments for each task.
Even conventional chat assistants have outgrown the simple request–response model. They reason for extended periods, call tools, spin up virtual machines to execute code, and let users switch from typing to talking mid-conversation. Across the board, the agent is no longer just a function you call—it is a process that runs.
Many of these systems still run locally. Not because local is the ideal end state, but because the cloud infrastructure for running persistent, stateful agents at scale is still incomplete.
The Infrastructure Gap for Persistent Agents
Running a process on your own machine is easy. Running a fleet of millions of them in the cloud—each stateful, each long-lived, each handling concurrent inputs and delivering real-time responsiveness—is an unsolved problem for most infrastructure teams.
Voice is one of the few domains where teams have already been forced to figure this out. They had no choice: real-time, conversational experiences break if latency spikes, state disappears, or sessions are moved arbitrarily between machines. As more categories of agents become persistent, that hard-won expertise becomes a blueprint, not just for voice, but for the broader agent ecosystem.
What Enterprise Leaders Should Take Away
Chat won because it fit neatly into infrastructure that was already everywhere. Voice shows what happens when AI does not.
Persistent agents need infrastructure that treats them as processes rather than as isolated HTTP requests: purpose-built runtimes for long-lived workloads, real-time coordination across components, and observability that is session-aware instead of purely request-based. Voice was simply the first large-scale use case that forced teams to build all of this from scratch.
For enterprise and infrastructure leaders, the implication is not “bet everything on voice,” but something more general: as agents move from stateless functions to persistent collaborators, your architecture, tooling, and operating model will need to evolve with them. Systems built for short-lived, stateless traffic will keep delivering value, but they will increasingly sit alongside a new class of infrastructure designed for always-on, stateful agents.
Voice showed that this shift is possible—and difficult. The next wave of agentic applications will determine how quickly the rest of the stack catches up.
ALSO READ: Agentic AI in Production: Why Better Prompts Won’t Bridge the Gap