🤖 AI Summary
This work addresses the challenge of enabling natural, proactive, and emotionally expressive real-time interaction in speech AI agents. We propose an end-to-end speech–language foundation model. Methodologically, we introduce a novel hierarchical multi-scale Transformer architecture that unifies speech perception, language understanding, and affective speech generation; integrate a full-duplex streaming encoder, lightweight acoustic adapters, and prompt-driven persona control to jointly optimize ASR, TTS, and speech translation. Experiments show an end-to-end response latency of only 195 ms—below human average reaction time—with a 22% reduction in ASR WER and a TTS MOS score of 4.3. The model supports over 100 languages, >1M pre-trained voices, and custom voice synthesis from 10-second audio samples. Open-sourced, it establishes a new paradigm for embodied, autonomous, and empathetic speech agents.
📝 Abstract
A voice AI agent that blends seamlessly into daily life would interact with humans in an autonomous, real-time, and emotionally expressive manner. Rather than merely reacting to commands, it would continuously listen, reason, and respond proactively, fostering fluid, dynamic, and emotionally resonant interactions. We introduce Voila, a family of large voice-language foundation models that make a step towards this vision. Voila moves beyond traditional pipeline systems by adopting a new end-to-end architecture that enables full-duplex, low-latency conversations while preserving rich vocal nuances such as tone, rhythm, and emotion. It achieves a response latency of just 195 milliseconds, surpassing the average human response time. Its hierarchical multi-scale Transformer integrates the reasoning capabilities of large language models (LLMs) with powerful acoustic modeling, enabling natural, persona-aware voice generation -- where users can simply write text instructions to define the speaker's identity, tone, and other characteristics. Moreover, Voila supports over one million pre-built voices and efficient customization of new ones from brief audio samples as short as 10 seconds. Beyond spoken dialogue, Voila is designed as a unified model for a wide range of voice-based applications, including automatic speech recognition (ASR), Text-to-Speech (TTS), and, with minimal adaptation, multilingual speech translation. Voila is fully open-sourced to support open research and accelerate progress toward next-generation human-machine interactions.