MCGA: A Multi-task Classical Chinese Literary Genre Audio Corpus

📅 2026-01-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the critical scarcity of audio corpora for Classical Chinese texts, which severely limits the application of multimodal large language models (MLLMs) in speech-related tasks. To bridge this gap, we present MCGA—the first multitask audio corpus encompassing diverse Classical Chinese genres—supporting six core tasks: automatic speech recognition, speech-to-text translation, spoken emotion description, spoken question answering, and speech understanding and reasoning. We further introduce a novel evaluation metric for spoken emotion description and a consistency measure for assessing alignment between speech and text capabilities. Through a systematic data collection and annotation pipeline, we benchmark ten prominent MLLMs, revealing significant performance deficiencies in Classical Chinese speech tasks. The corpus, along with code and evaluation protocols, is publicly released to foster future research in this underexplored domain.

Technology Category

Application Category

📝 Abstract
With the rapid advancement of Multimodal Large Language Models (MLLMs), their potential has gained significant attention in Chinese Classical Studies (CCS). While existing research primarily focuses on text and visual modalities, the audio corpus within this domain remains largely underexplored. To bridge this gap, we introduce the Multi-task Classical Chinese Literary Genre Audio Corpus (MCGA), a 119-hour corpus comprising 22,000 audio samples. It encompasses a diverse range of literary genres across six tasks: Automatic Speech Recognition (ASR), Speech-to-Text Translation (S2TT), Speech Emotion Captioning (SEC), Spoken Question Answering (SQA), Speech Understanding (SU), and Speech Reasoning (SR). Through the evaluation of ten MLLMs, our experimental results demonstrate that current MLLMs still face substantial challenges on the MCGA test set. Furthermore, we introduce a domain-specific metric for SEC and a metric to measure the consistency between speech and text capabilities. We release MCGA to the public to facilitate the development of more robust MLLMs. MCGA Corpus: https://github.com/yxduir/MCGA
Problem

Research questions and friction points this paper is trying to address.

Classical Chinese
Audio Corpus
Multimodal Large Language Models
Literary Genre
Speech Processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-task Audio Corpus
Classical Chinese
Multimodal Large Language Models
Speech Emotion Captioning
Audio-Text Consistency
🔎 Similar Papers
No similar papers found.