Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play

📅 2025-05-05

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the challenge of enabling natural, proactive, and emotionally expressive real-time interaction in speech AI agents. We propose an end-to-end speech–language foundation model. Methodologically, we introduce a novel hierarchical multi-scale Transformer architecture that unifies speech perception, language understanding, and affective speech generation; integrate a full-duplex streaming encoder, lightweight acoustic adapters, and prompt-driven persona control to jointly optimize ASR, TTS, and speech translation. Experiments show an end-to-end response latency of only 195 ms—below human average reaction time—with a 22% reduction in ASR WER and a TTS MOS score of 4.3. The model supports over 100 languages, >1M pre-trained voices, and custom voice synthesis from 10-second audio samples. Open-sourced, it establishes a new paradigm for embodied, autonomous, and empathetic speech agents.

Technology Category

Application Category

📝 Abstract

A voice AI agent that blends seamlessly into daily life would interact with humans in an autonomous, real-time, and emotionally expressive manner. Rather than merely reacting to commands, it would continuously listen, reason, and respond proactively, fostering fluid, dynamic, and emotionally resonant interactions. We introduce Voila, a family of large voice-language foundation models that make a step towards this vision. Voila moves beyond traditional pipeline systems by adopting a new end-to-end architecture that enables full-duplex, low-latency conversations while preserving rich vocal nuances such as tone, rhythm, and emotion. It achieves a response latency of just 195 milliseconds, surpassing the average human response time. Its hierarchical multi-scale Transformer integrates the reasoning capabilities of large language models (LLMs) with powerful acoustic modeling, enabling natural, persona-aware voice generation -- where users can simply write text instructions to define the speaker's identity, tone, and other characteristics. Moreover, Voila supports over one million pre-built voices and efficient customization of new ones from brief audio samples as short as 10 seconds. Beyond spoken dialogue, Voila is designed as a unified model for a wide range of voice-based applications, including automatic speech recognition (ASR), Text-to-Speech (TTS), and, with minimal adaptation, multilingual speech translation. Voila is fully open-sourced to support open research and accelerate progress toward next-generation human-machine interactions.

Problem

Research questions and friction points this paper is trying to address.

Develops real-time autonomous voice AI for human-like interaction

Enables emotionally expressive, full-duplex, low-latency voice conversations

Unifies multilingual speech tasks with customizable voice generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end full-duplex low-latency voice conversation architecture

Hierarchical multi-scale Transformer integrating LLMs and acoustic modeling

Supports one million pre-built voices and rapid customization

🔎 Similar Papers

Multimodal Human-Autonomous Agents Interaction Using Pre-Trained Language and Visual Foundation Models

2024-03-18arXiv.orgCitations: 1

💼 Related Jobs

Member of Technical Staff - Voice Model

xAI

$150,000 - $450,000 USD

Palo Alto, CA / Palo Alto, CA, Palo Alto, California, United States

AI Research Scientist - Voice AI Team, Meta Superintelligence Labs