Towards Building Speech Large Language Models for Multitask Understanding in Low-Resource Languages

๐Ÿ“… 2025-09-18
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Low-resource languages like Thai face critical challenges in speech large language modeling (SLLM), including poor speech encoder performance, weak multimodal understanding capabilities, high computational cost of ASR-based forced alignment, and scarcity of paired speech-text data. To address these, this work proposes a systematic solution: (1) the first Thai self-supervised speech encoder, XLSR-Thai; (2) U-Align, a lightweight cross-modal alignment method that replaces expensive ASR-based forced alignment; and (3) Thai-SUP, a scalable Thai understanding data synthesis framework generating over 1,000 hours of high-quality, multitask training data. Through joint optimization via self-supervised pretraining, U-Align fine-tuning, and cross-lingual transfer, our approach significantly improves Thai speech recognition, semantic understanding, and instruction-following performance. All models and datasets are publicly released, establishing essential infrastructure for low-resource speech understanding research.

Technology Category

Application Category

๐Ÿ“ Abstract
Speech large language models (SLLMs) built on speech encoders, adapters, and LLMs demonstrate remarkable multitask understanding performance in high-resource languages such as English and Chinese. However, their effectiveness substantially degrades in low-resource languages such as Thai. This limitation arises from three factors: (1) existing commonly used speech encoders, like the Whisper family, underperform in low-resource languages and lack support for broader spoken language understanding tasks; (2) the ASR-based alignment paradigm requires training the entire SLLM, leading to high computational cost; (3) paired speech-text data in low-resource languages is scarce. To overcome these challenges in the low-resource language Thai, we introduce XLSR-Thai, the first self-supervised learning (SSL) speech encoder for Thai. It is obtained by continuously training the standard SSL XLSR model on 36,000 hours of Thai speech data. Furthermore, we propose U-Align, a speech-text alignment method that is more resource-efficient and multitask-effective than typical ASR-based alignment. Finally, we present Thai-SUP, a pipeline for generating Thai spoken language understanding data from high-resource languages, yielding the first Thai spoken language understanding dataset of over 1,000 hours. Multiple experiments demonstrate the effectiveness of our methods in building a Thai multitask-understanding SLLM. We open-source XLSR-Thai and Thai-SUP to facilitate future research.
Problem

Research questions and friction points this paper is trying to address.

Addressing speech model performance degradation in low-resource languages like Thai
Overcoming data scarcity and high computational costs in speech-text alignment
Developing effective multitask speech understanding without extensive labeled data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed XLSR-Thai SSL speech encoder for Thai
Proposed U-Align efficient speech-text alignment method
Created Thai-SUP pipeline generating SLU dataset
๐Ÿ”Ž Similar Papers
No similar papers found.
M
Mingchen Shao
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xiโ€™an, China
Bingshen Mu
Bingshen Mu
Northwestern Polytechnical University
Speech RecognitionSpeech Understanding
C
Chengyou Wang
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xiโ€™an, China
H
Hai Li
iQIYI, Inc., China
Ying Yan
Ying Yan
Microsoft Research
Big Data Management
Z
Zhonghua Fu
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xiโ€™an, China
L
Lei Xie
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xiโ€™an, China