MultiVox: Benchmarking Voice Assistants for Multimodal Interactions

📅 2025-07-14

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Existing benchmarks for voice assistants inadequately evaluate models’ joint understanding of fine-grained paralinguistic features—such as pitch, emotion, timbre, loudness, and ambient sounds—as well as visual cues, particularly lacking support for multimodal implicit context modeling. To address this gap, we introduce MultiVox, the first benchmark specifically designed for multimodal voice assistants. It comprises 1,000 human-annotated speech–image/video dialogues, explicitly modeling cross-modal alignment among acoustic, visual, and paralinguistic signals. The benchmark emphasizes context-aware response generation and establishes a standardized evaluation protocol. Experiments across nine state-of-the-art models reveal substantial performance gaps—especially in multimodal contextual fusion and response quality—relative to human baselines. These results underscore MultiVox’s critical role in advancing research and development of next-generation multimodal voice assistants.

Technology Category

Application Category

📝 Abstract

The rapid progress of Large Language Models (LLMs) has empowered omni models to act as voice assistants capable of understanding spoken dialogues. These models can process multimodal inputs beyond text, such as speech and visual data, enabling more context-aware interactions. However, current benchmarks fall short in comprehensively evaluating how well these models generate context-aware responses, particularly when it comes to implicitly understanding fine-grained speech characteristics, such as pitch, emotion, timbre, and volume or the environmental acoustic context such as background sounds. Additionally, they inadequately assess the ability of models to align paralinguistic cues with complementary visual signals to inform their responses. To address these gaps, we introduce MultiVox, the first omni voice assistant benchmark designed to evaluate the ability of voice assistants to integrate spoken and visual cues including paralinguistic speech features for truly multimodal understanding. Specifically, MultiVox includes 1000 human-annotated and recorded speech dialogues that encompass diverse paralinguistic features and a range of visual cues such as images and videos. Our evaluation on 9 state-of-the-art models reveals that, although humans excel at these tasks, current models consistently struggle to produce contextually grounded responses.

Problem

Research questions and friction points this paper is trying to address.

Evaluating voice assistants' multimodal context-aware response generation

Assessing implicit understanding of fine-grained speech and environmental cues

Testing alignment of paralinguistic cues with visual signals for responses

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark evaluates multimodal speech and visual integration

Includes 1000 annotated dialogues with paralinguistic features

Assesses model alignment of speech cues with visuals

🔎 Similar Papers

AudioBench: A Universal Benchmark for Audio Large Language Models