Building Enterprise Realtime Voice Agents from Scratch: A Technical Tutorial

📅 2026-03-05

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the absence of open-source, end-to-end streaming voice agent implementations tailored for enterprise applications. We propose a real-time voice agent system built upon a cascaded architecture of streaming speech-to-text (STT), large language model (LLM) with function calling support, and text-to-speech (TTS), emphasizing pipeline-level coordination across components rather than reliance on a single high-performance model. The system integrates Deepgram, vLLM (self-hosted on NVIDIA A10G), and ElevenLabs, achieving a median time-to-first-audio latency of 947 ms (best case: 729 ms). To our knowledge, this is the first open-source, fully reproducible, enterprise-grade implementation of an end-to-end real-time voice agent, demonstrating that streaming interaction performance hinges critically on inter-component pipeline design.

Technology Category

Application Category

📝 Abstract

We present a technical tutorial for building enterprise-grade realtime voice agents from first principles. While over 25 open-source speech-to-speech models and numerous voice agent frameworks exist, no single resource explains the complete pipeline from individual components to a working streaming voice agent with function calling capabilities. Through systematic investigation, we find that (1) native speech-to-speech models like Qwen2.5-Omni, while capable of high-quality audio generation, are too slow for realtime interaction ($\sim$13s time-to-first-audio); (2) the industry-standard approach uses a cascaded streaming pipeline: STT $\rightarrow$ LLM $\rightarrow$ TTS, where each component streams its output to the next; and (3) the key to ``realtime''is not any single fast model but rather \textit{streaming and pipelining} across components. We build a complete voice agent using Deepgram (streaming STT), vLLM-served LLMs with function calling (streaming text generation), and ElevenLabs (streaming TTS), achieving a measured P50 time-to-first-audio of 947ms (best case 729ms) with cloud LLM APIs, and comparable latency with self-hosted vLLM on NVIDIA A10G GPU. We release the full codebase as a tutorial with working, tested code for every component.

Problem

Research questions and friction points this paper is trying to address.

realtime voice agents

streaming pipeline

speech-to-speech

enterprise AI

function calling

Innovation

Methods, ideas, or system contributions that make the work stand out.

streaming pipeline

realtime voice agent

function calling