Voice Evaluation of Reasoning Ability: Diagnosing the Modality-Induced Performance Gap

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the reasoning capabilities of speech-based interactive systems under real-time dialogue constraints, revealing a substantial performance gap between speech and text modalities. To address this, we introduce VERA—the first benchmark explicitly designed for native speech reasoning—comprising 2,931 authentic spoken dialogues spanning five task categories: mathematics, web navigation, scientific reasoning, long-context understanding, and factual recall. VERA enables cross-modal comparison and architectural analysis. We evaluate 12 state-of-the-art systems using joint latency-accuracy assessment, cascade-decoupled modeling, and fine-grained error diagnostics. Results show that the best text-based model achieves 54.0% average accuracy, while speech-based systems attain only 11.3%; the gap reaches 68.7 percentage points on mathematical reasoning. This study provides the first empirical evidence of accuracy stagnation in low-latency speech reasoning and identifies fundamental limitations in current decoupled (ASR → LLM → TTS) architectures.

Technology Category

Application Category

📝 Abstract
We present Voice Evaluation of Reasoning Ability (VERA), a benchmark for evaluating reasoning ability in voice-interactive systems under real-time conversational constraints. VERA comprises 2,931 voice-native episodes derived from established text benchmarks and organized into five tracks (Math, Web, Science, Long-Context, Factual). Each item is adapted for speech interaction while preserving reasoning difficulty. VERA enables direct text-voice comparison within model families and supports analysis of how architectural choices affect reliability. We assess 12 contemporary voice systems alongside strong text baselines and observe large, consistent modality gaps: on competition mathematics a leading text model attains 74.8% accuracy while its voice counterpart reaches 6.1%; macro-averaged across tracks the best text models achieve 54.0% versus 11.3% for voice. Latency-accuracy analyses reveal a low-latency plateau, where fast voice systems cluster around ~10% accuracy, while approaching text performance requires sacrificing real-time interaction. Diagnostic experiments indicate that common mitigations are insufficient. Increasing "thinking time" yields negligible gains; a decoupled cascade that separates reasoning from narration improves accuracy but still falls well short of text and introduces characteristic grounding/consistency errors. Failure analyses further show distinct error signatures across native streaming, end-to-end, and cascade designs. VERA provides a reproducible testbed and targeted diagnostics for architectures that decouple thinking from speaking, offering a principled way to measure progress toward real-time voice assistants that are both fluent and reliably reasoned.
Problem

Research questions and friction points this paper is trying to address.

Evaluating reasoning ability in voice-interactive systems under real-time constraints
Addressing large performance gaps between text and voice modalities in AI systems
Diagnosing why common mitigations fail to bridge text-voice reasoning performance gaps
Innovation

Methods, ideas, or system contributions that make the work stand out.

Voice-native benchmark for real-time reasoning evaluation
Direct text-voice comparison within model architectures
Diagnostic framework for thinking-speaking decoupled systems
🔎 Similar Papers
Yueqian Lin
Yueqian Lin
PhD Student, Duke University
Zhengmian Hu
Zhengmian Hu
Adobe Research
Deep LearningMonte Carlo
Qinsi Wang
Qinsi Wang
Duke University
Efficiency LLMModel Accelerate
Y
Yudong Liu
Duke University, Durham, NC, USA
H
Hengfan Zhang
Duke University, Durham, NC, USA
Jayakumar Subramanian
Jayakumar Subramanian
Senior Research Scientist, Adobe India
Agent based modelsReinforcement LearningMulti-agent Reinforcement LearningGame theoryDeep Learning
N
Nikos Vlassis
Adobe, San Jose, CA, USA
H
Hai Helen Li
Duke University, Durham, NC, USA
Y
Yiran Chen
Duke University, Durham, NC, USA