🤖 AI Summary
This study addresses the limited performance of existing large audio language models in capturing regional dialect prosody, primarily due to the scarcity of high-quality region-specific speech-text paired data. To bridge this gap, the authors construct TW-Sound580K, a dataset comprising 580,000 audio-instruction pairs from Taiwan, and introduce a novel Verify-Generate-Critic (VGC) data construction paradigm. This framework integrates a dynamic dual-ASR arbitration mechanism with a teacher-model-guided generation strategy to enable efficient data cleaning and augmentation. Leveraging this dataset, the proposed Tai-LALM model achieves 49.1% accuracy on the TAU benchmark, outperforming the zero-shot baseline by 6.5 percentage points and demonstrating significantly enhanced capabilities in modeling and understanding regionally accented speech.
📝 Abstract
Large Audio-Language Models (LALMs) typically struggle with localized dialectal prosody due to the scarcity of specialized corpora. We present TW-Sound580K, a Taiwanese audio-text instruction dataset developed through a Verify-Generate-Critique (VGC) protocol. This pipeline leverages Dual-ASR validation to filter 522K raw clips, subsequently expanding them into 580,000 high-fidelity instruction pairs using a teacher model. The dataset's utility is demonstrated through Tai-LALM, which fine-tunes a DeSTA 2.5-Audio-initialized backbone and incorporates a dynamic Dual-ASR Arbitration strategy to optimize transcription selection during inference. On the TAU Benchmark, Tai-LALM reaches 49.1% accuracy, marking a 6.5% absolute improvement over the zero-shot baseline (42.6% with ASR text conditioning). This confirms that integrating regional corpora with rigorous curation and dynamic arbitration significantly enhances LALM performance on localized speech.