VCB Bench: An Evaluation Benchmark for Audio-Grounded Large Language Model Conversational Agents

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Existing Chinese audio-language model evaluation benchmarks suffer from three key limitations: English-centric design, overreliance on synthetic speech, and absence of multidimensional, fine-grained assessment. To address these gaps, we introduce VCB Bench—the first high-quality, human-speech-based benchmark tailored for Chinese spoken dialogue evaluation. It systematically assesses models along three core dimensions: instruction following, knowledge comprehension, and system robustness. Methodologically, it innovates with speech-level control mechanisms, realistic contextual perturbations (e.g., background noise, disfluencies), and cross-speaker stability testing. Unlike conventional single-dimension or synthetic-data benchmarks, VCB Bench enables reproducible, fine-grained performance analysis. Empirical evaluation across state-of-the-art audio large language models reveals substantial capability disparities—particularly in robustness and contextual grounding—highlighting critical bottlenecks. VCB Bench thus establishes foundational infrastructure and practical guidelines for standardized evaluation and targeted improvement of Chinese spoken dialogue systems.

Technology Category

Application Category

📝 Abstract

Recent advances in large audio language models (LALMs) have greatly enhanced multimodal conversational systems. However, existing benchmarks remain limited -- they are mainly English-centric, rely on synthetic speech, and lack comprehensive, discriminative evaluation across multiple dimensions. To address these gaps, we present Voice Chat Bot Bench (VCB Bench) -- a high-quality Chinese benchmark built entirely on real human speech. VCB Bench evaluates LALMs from three complementary perspectives: instruction following (including speech-level control beyond text commands), knowledge understanding (general knowledge, reasoning, and daily dialogue), and robustness (stability under perturbations in content, environment, and speaker traits). Experiments on representative LALMs reveal notable performance gaps and highlight future directions for improvement. VCB Bench provides a reproducible and fine-grained evaluation framework, offering standardized methodology and practical insights for advancing Chinese voice conversational models.

Problem

Research questions and friction points this paper is trying to address.

Addressing limitations in existing audio-grounded conversational agent benchmarks

Evaluating multimodal models across instruction following and knowledge understanding

Providing robust assessment framework for Chinese voice conversational systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark uses real human speech data

Evaluates models across three complementary perspectives

Provides reproducible fine-grained evaluation framework

🔎 Similar Papers

PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation