π€ AI Summary
This study addresses the scarcity of high-quality resources and unified modeling frameworks for Singaporean English (Singlish) spoken language understanding. We introduce MNSC, the first large-scale, standardized, multi-task speech corpus for Singlish, covering automatic speech recognition (ASR), spoken question answering, dialogue summarization, and paralinguistic understanding. We further propose SingAudioLLM, the first end-to-end multi-task multimodal model for Singlish, integrating speech-text alignment pretraining, multi-task prompt learning, and human-verified, standardized data splits. Compared to conventional cascaded systems and existing AudioLLMs, SingAudioLLM achieves 10β30% performance gains across multiple Singlish spoken language understanding tasks, establishing new state-of-the-art results. Our core contributions are: (1) the first open-source, high-quality Singlish speech benchmark; (2) the first multi-task joint audio-language modeling paradigm; and (3) systematic empirical validation of multimodal large language modelsβ efficacy in low-resource dialectal spoken language understanding.
π Abstract
Singlish, a Creole language rooted in English, is a key focus in linguistic research within multilingual and multicultural contexts. However, its spoken form remains underexplored, limiting insights into its linguistic structure and applications. To address this gap, we standardize and annotate the largest spoken Singlish corpus, introducing the Multitask National Speech Corpus (MNSC). These datasets support diverse tasks, including Automatic Speech Recognition (ASR), Spoken Question Answering (SQA), Spoken Dialogue Summarization (SDS), and Paralinguistic Question Answering (PQA). We release standardized splits and a human-verified test set to facilitate further research. Additionally, we propose SingAudioLLM, a multi-task multimodal model leveraging multimodal large language models to handle these tasks concurrently. Experiments reveal our models adaptability to Singlish context, achieving state-of-the-art performance and outperforming prior models by 10-30% in comparison with other AudioLLMs and cascaded solutions.