VCB Bench: An Evaluation Benchmark for Audio-Grounded Large Language Model Conversational Agents

📅 2025-10-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Chinese audio-language model evaluation benchmarks suffer from three key limitations: English-centric design, overreliance on synthetic speech, and absence of multidimensional, fine-grained assessment. To address these gaps, we introduce VCB Bench—the first high-quality, human-speech-based benchmark tailored for Chinese spoken dialogue evaluation. It systematically assesses models along three core dimensions: instruction following, knowledge comprehension, and system robustness. Methodologically, it innovates with speech-level control mechanisms, realistic contextual perturbations (e.g., background noise, disfluencies), and cross-speaker stability testing. Unlike conventional single-dimension or synthetic-data benchmarks, VCB Bench enables reproducible, fine-grained performance analysis. Empirical evaluation across state-of-the-art audio large language models reveals substantial capability disparities—particularly in robustness and contextual grounding—highlighting critical bottlenecks. VCB Bench thus establishes foundational infrastructure and practical guidelines for standardized evaluation and targeted improvement of Chinese spoken dialogue systems.

Technology Category

Application Category

📝 Abstract
Recent advances in large audio language models (LALMs) have greatly enhanced multimodal conversational systems. However, existing benchmarks remain limited -- they are mainly English-centric, rely on synthetic speech, and lack comprehensive, discriminative evaluation across multiple dimensions. To address these gaps, we present Voice Chat Bot Bench (VCB Bench) -- a high-quality Chinese benchmark built entirely on real human speech. VCB Bench evaluates LALMs from three complementary perspectives: instruction following (including speech-level control beyond text commands), knowledge understanding (general knowledge, reasoning, and daily dialogue), and robustness (stability under perturbations in content, environment, and speaker traits). Experiments on representative LALMs reveal notable performance gaps and highlight future directions for improvement. VCB Bench provides a reproducible and fine-grained evaluation framework, offering standardized methodology and practical insights for advancing Chinese voice conversational models.
Problem

Research questions and friction points this paper is trying to address.

Addressing limitations in existing audio-grounded conversational agent benchmarks
Evaluating multimodal models across instruction following and knowledge understanding
Providing robust assessment framework for Chinese voice conversational systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark uses real human speech data
Evaluates models across three complementary perspectives
Provides reproducible fine-grained evaluation framework
🔎 Similar Papers
No similar papers found.
J
Jiliang Hu
Tencent AI Lab, Beijing, China
W
Wenfu Wang
Tencent AI Lab, Beijing, China
Zuchao Li
Zuchao Li
Wuhan University
Natural Language ProcessingMachine Learning
C
Chenxing Li
Tencent AI Lab, Beijing, China
Yiyang Zhao
Yiyang Zhao
Ingdan Labs
Internet of ThingsMobile Computing
Hanzhao Li
Hanzhao Li
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern
Speech SynthesisSpontaneous SpeechSpeech Codec
L
Liqiang Zhang
Tencent AI Lab, Beijing, China
M
Meng Yu
Tencent AI Lab, Beijing, China
D
Dong Yu
Tencent AI Lab, Beijing, China