SeaLLMs-Audio: Large Audio-Language Models for Southeast Asia

📅 2025-11-03

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

To address the lack of a unified large audio-language model (LALM) for multilingual audio understanding and speech interaction in Southeast Asia—covering Indonesian, Thai, Vietnamese, English, and Chinese—this work introduces the first region-specific multilingual LALM. Methodologically, we pioneer the adaptation of large language models to Southeast Asian audio scenarios via an end-to-end architecture that jointly trains a speech encoder with a multilingual language model, supporting audio-only, text-only, and multimodal inputs while enabling cross-modal alignment and generation. Key contributions include: (1) constructing a high-quality, five-language training corpus; (2) releasing SeaBench-Audio, the first regional multimodal evaluation benchmark; and (3) achieving state-of-the-art performance on speech recognition, emotion recognition, and spoken question answering—particularly outperforming prior models on low-resource Southeast Asian languages, thereby establishing foundational support for regional speech intelligence applications.

Technology Category

Application Category

📝 Abstract

We introduce SeaLLMs-Audio, the first large audio-language model (LALM) tailored for multiple Southeast Asian (SEA) languages-Indonesian (id), Thai (th), and Vietnamese (vi)-alongside English (en) and Chinese (zh). Trained on a large-scale audio corpus, SeaLLMs-Audio exhibits strong performance across diverse audio-centric tasks, spanning fine-grained audio understanding and voice-based interaction. Its key features include: 1) Multilingual: the model primarily supports 5 languages, namely Indonesian, Thai, Vietnamese, English, and Chinese; 2) Multimodal: the model accepts flexible input modalities, including audio only, text only, as well as audio with text; 3) Multi-task: the model supports a wide range of tasks, including audio analysis tasks such as Audio Captioning, Automatic Speech Recognition, Speech-to-Text Translation, Speech Emotion Recognition, Speech Question Answering, and Speech Summarization. It also enables voice-based dialogue, including answering factual, mathematical, and general knowledge queries. As a significant step towards advancing audio LLMs in Southeast Asia, we expect SeaLLMs-Audio to benefit both the regional research community and industry. To automate LALM evaluation for Southeast Asia, we introduce SeaBench-Audio, a benchmark spanning multiple tasks. Experiments show that SeaLLMs-Audio achieves competitive performance compared with other LALMs on SEA languages.

Problem

Research questions and friction points this paper is trying to address.

Develops first multilingual audio-language model for Southeast Asian languages

Enables multimodal audio-text understanding and voice-based interactions

Supports diverse audio tasks from speech recognition to dialogue systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large audio-language model tailored for Southeast Asian languages

Multimodal input supporting audio, text, and combined modalities

Multi-task capabilities spanning speech recognition and voice dialogue

🔎 Similar Papers

Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models

2024-09-17arXiv.orgCitations: 5