$τ$-Voice: Benchmarking Full-Duplex Voice Agents on Real-World Domains

📅 2026-03-13

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

Existing evaluation methods struggle to simultaneously assess voice agents’ capabilities in dialogue dynamics, task completion, and full-duplex interaction within realistic, complex scenarios. This work proposes the first benchmark that integrates verifiable task success, full-duplex spoken interaction, and realistic audio conditions. It introduces a reproducible and ecologically valid evaluation framework by innovatively combining grounded multi-turn tasks, domain-specific policy constraints, and a controllable spoken user simulator capable of modeling diverse accents, background noise, and natural turn-taking. Evaluated on an extended τ²-bench comprising 278 tasks, voice agents achieve only 31–51% success rates in clean conditions—substantially lower than GPT-5’s 85% text-based performance—with 79–90% of failures attributable to agent behavior, revealing a stark gap between current spoken dialogue systems and advanced text-based reasoning.

Technology Category

Application Category

📝 Abstract

Full-duplex voice agents--systems that listen and speak simultaneously--are rapidly moving from research to production. However, existing evaluations address conversational dynamics and task completion in isolation. We introduce $τ$-voice, a benchmark for evaluating voice agents on grounded tasks with real-world complexity: agents must navigate complex multi-turn conversations, adhere to domain policies, and interact with the environment. The framework extends $τ^2$-bench into a novel voice agent benchmark combining verifiable completion of complex grounded tasks, full-duplex interaction, and realistic audio--enabling direct comparison between voice and text performance. A controllable and realistic voice user simulator provides diverse accents, realistic audio environments, and rich turn-taking dynamics; by decoupling simulation from wall-clock time, the user simulator can use the most capable LLM without real-time constraints. We evaluate task completion (pass@1) and voice interaction quality across 278 tasks: while GPT-5 (reasoning) achieves 85%, voice agents reach only 31--51% under clean conditions and 26--38% under realistic conditions with noise and diverse accents--retaining only 30--45% of text capability; qualitative analysis confirms 79--90% of failures stem from agent behavior, suggesting that observed failures primarily reflect agent behavior under our evaluation setup. $τ$-voice provides a reproducible testbed for measuring progress toward voice agents that are natural, conversational, and reliable.

Problem

Research questions and friction points this paper is trying to address.

full-duplex voice agents

benchmarking

real-world tasks

voice interaction

task completion

Innovation

Methods, ideas, or system contributions that make the work stand out.

full-duplex voice agents

voice benchmark

realistic audio simulation