KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the overreliance on English in evaluating speech language models (SpeechLMs) and the consequent lack of high-quality, linguistically appropriate benchmarks for non-English languages such as Korean. To bridge this gap, the work proposes two agent-based human-in-the-loop frameworks: one that transfers existing spoken question-answering benchmarks from source languages to Korean, and another that constructs an audio understanding benchmark directly from Korean ASR corpora and speaker metadata, thereby circumventing the distortions introduced by conventional ASR–translation–TTS pipelines. The authors present the first comprehensive SpeechLM evaluation suite for Korean, releasing three benchmarks—KVoiceBench, KOpenAudioBench, and KMMAU—comprising 12,345 samples. Evaluations of eight prominent models reveal substantial performance disparities between English and Korean tasks, as well as inconsistencies across task types, underscoring the critical need for multilingual evaluation frameworks.
📝 Abstract
Speech language models (SpeechLMs) have achieved substantial progress by extending large language models (LLMs) to the speech modality. However, SpeechLM evaluation remains heavily centered on English, limiting reliable assessment of multilingual speech capabilities. Straightforward benchmark transfer through ASR, translation, normalization, and TTS can corrupt language-specific instructions, answer constraints, and spoken forms; for audio understanding, transferring source-language audio also fails to preserve target-language speaker attributes, accents, and paralinguistic properties. To address these limitations, we propose two human-agent benchmark-construction frameworks: one transfers source-language SpokenQA benchmarks into target-language SpokenQA benchmarks, and the other converts target-language ASR corpora into audio understanding benchmarks using transcriptions and speaker metadata. Using these frameworks, we construct and publicly release three Korean speech benchmarks: KVoiceBench and KOpenAudioBench for Korean SpokenQA, and KMMAU for Korean audio understanding, comprising 12,345 samples in total. We evaluate eight recent SpeechLMs and find that English-Korean performance gaps vary substantially across models and task families, and that SpokenQA and audio understanding rankings diverge, revealing complementary weaknesses invisible to English-only evaluation.
Problem

Research questions and friction points this paper is trying to address.

SpeechLM evaluation
multilingual speech benchmarks
Korean speech
SpokenQA
audio understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Speech Language Models
Multilingual Speech Benchmarking
Agent-Driven Benchmark Construction
Korean SpokenQA
Audio Understanding
🔎 Similar Papers