Definition

What is a voice agent?

Last updated

Definition

A voice agent is an AI agent that interacts over a phone or voice channel — taking spoken input through speech-to-text, reasoning with an LLM, and responding through text-to-speech, typically with sub-second latency.

Voice agents combine four components: a speech-to-text (STT) model that transcribes the caller, a large language model that reasons and decides what to say, a text-to-speech (TTS) model that voices the response, and a real-time transport layer (LiveKit, WebRTC, SIP) that handles the audio. Production voice agents add tool calls, conversation memory, DND lists, recording for compliance, and human handoff. Use cases span COD-confirm calls, AI receptionists, appointment confirmations, and outbound surveys.

Latency is the headline metric

Humans expect sub-second turn-taking on a phone call. Anything above ~700ms feels broken. That constraint shapes every architectural choice — model selection (smaller, faster), STT streaming vs batch, TTS streaming, transport (WebRTC vs polled HTTP), and where the components run (same region, ideally same data center).

Indian-language voice

The English voice-agent market is well-served. Indian-language voice (Hindi, Punjabi, Tamil, Telugu, Bengali, Marathi, Gujarati) is much thinner — most generic STT models degrade significantly on regional accents and code-switching. Sarvam was built specifically for Indian languages and is the default STT in Glitch Grow’s Voice AI Agent boilerplate.

Cost structure

Voice-agent cost decomposes into per-minute STT, per-token LLM, per-character TTS, plus telephony. A well-tuned stack hits ~$0.02/min raw infra cost. Managed platforms layer ~$0.05–$0.10/min in markup on top.

Related terms

Related agents

Sources

Free Vibe Coder Kit

Get the kit. Ship like a vibe coder.

Installs into Claude Code, Codex, or OpenClaws in under a minute. Required to deploy our paid agents.

Protected by Cloudflare Turnstile. We never share your details. Unsubscribe any time.