TW-Sound580K: A Regional Audio-Text Dataset with Verification-Guided Curation for Localized Audio-Language Modeling

📅 2026-03-05

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This study addresses the limited performance of existing large audio language models in capturing regional dialect prosody, primarily due to the scarcity of high-quality region-specific speech-text paired data. To bridge this gap, the authors construct TW-Sound580K, a dataset comprising 580,000 audio-instruction pairs from Taiwan, and introduce a novel Verify-Generate-Critic (VGC) data construction paradigm. This framework integrates a dynamic dual-ASR arbitration mechanism with a teacher-model-guided generation strategy to enable efficient data cleaning and augmentation. Leveraging this dataset, the proposed Tai-LALM model achieves 49.1% accuracy on the TAU benchmark, outperforming the zero-shot baseline by 6.5 percentage points and demonstrating significantly enhanced capabilities in modeling and understanding regionally accented speech.

Technology Category

Application Category

📝 Abstract

Large Audio-Language Models (LALMs) typically struggle with localized dialectal prosody due to the scarcity of specialized corpora. We present TW-Sound580K, a Taiwanese audio-text instruction dataset developed through a Verify-Generate-Critique (VGC) protocol. This pipeline leverages Dual-ASR validation to filter 522K raw clips, subsequently expanding them into 580,000 high-fidelity instruction pairs using a teacher model. The dataset's utility is demonstrated through Tai-LALM, which fine-tunes a DeSTA 2.5-Audio-initialized backbone and incorporates a dynamic Dual-ASR Arbitration strategy to optimize transcription selection during inference. On the TAU Benchmark, Tai-LALM reaches 49.1% accuracy, marking a 6.5% absolute improvement over the zero-shot baseline (42.6% with ASR text conditioning). This confirms that integrating regional corpora with rigorous curation and dynamic arbitration significantly enhances LALM performance on localized speech.

Problem

Research questions and friction points this paper is trying to address.

localized audio-language modeling

dialectal prosody

regional audio-text dataset

Large Audio-Language Models

speech recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Verify-Generate-Critique

Dual-ASR validation

Regional audio-text dataset

Dynamic arbitration

Localized audio-language modeling

🔎 Similar Papers

No similar papers found.