TW-Sound580K: A Regional Audio-Text Dataset with Verification-Guided Curation for Localized Audio-Language Modeling

📅 2026-03-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the limited performance of existing large audio language models in capturing regional dialect prosody, primarily due to the scarcity of high-quality region-specific speech-text paired data. To bridge this gap, the authors construct TW-Sound580K, a dataset comprising 580,000 audio-instruction pairs from Taiwan, and introduce a novel Verify-Generate-Critic (VGC) data construction paradigm. This framework integrates a dynamic dual-ASR arbitration mechanism with a teacher-model-guided generation strategy to enable efficient data cleaning and augmentation. Leveraging this dataset, the proposed Tai-LALM model achieves 49.1% accuracy on the TAU benchmark, outperforming the zero-shot baseline by 6.5 percentage points and demonstrating significantly enhanced capabilities in modeling and understanding regionally accented speech.

Technology Category

Application Category

📝 Abstract
Large Audio-Language Models (LALMs) typically struggle with localized dialectal prosody due to the scarcity of specialized corpora. We present TW-Sound580K, a Taiwanese audio-text instruction dataset developed through a Verify-Generate-Critique (VGC) protocol. This pipeline leverages Dual-ASR validation to filter 522K raw clips, subsequently expanding them into 580,000 high-fidelity instruction pairs using a teacher model. The dataset's utility is demonstrated through Tai-LALM, which fine-tunes a DeSTA 2.5-Audio-initialized backbone and incorporates a dynamic Dual-ASR Arbitration strategy to optimize transcription selection during inference. On the TAU Benchmark, Tai-LALM reaches 49.1% accuracy, marking a 6.5% absolute improvement over the zero-shot baseline (42.6% with ASR text conditioning). This confirms that integrating regional corpora with rigorous curation and dynamic arbitration significantly enhances LALM performance on localized speech.
Problem

Research questions and friction points this paper is trying to address.

localized audio-language modeling
dialectal prosody
regional audio-text dataset
Large Audio-Language Models
speech recognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Verify-Generate-Critique
Dual-ASR validation
Regional audio-text dataset
Dynamic arbitration
Localized audio-language modeling
🔎 Similar Papers
No similar papers found.