Ti-Audio: The First Multi-Dialectal End-to-End Speech LLM for Tibetan

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the challenges of modeling Tibetan speech, which suffers from severe data scarcity and pronounced phonetic disparities among its three major dialects—Ü-Tsang, Amdo, and Khams. To tackle these issues, we propose the first end-to-end multilingual speech large language model specifically designed for Tibetan. Our approach employs a dynamic Q-Former adapter to achieve cross-modal alignment between variable-length speech and text, and introduces a cross-dialect collaborative modeling mechanism, dialect-mutual data augmentation, and a temperature-controlled sampling strategy to effectively mitigate low-resource constraints. The proposed method achieves state-of-the-art performance on both Tibetan automatic speech recognition and speech translation tasks, offering a scalable new paradigm for building speech foundation models for resource-scarce languages.

Technology Category

Application Category

📝 Abstract

Recent advances in Speech Large Language Models (Speech-LLMs) have made significant progress, greatly enhancing multimodal interaction capabilities.However, their application in low-resource and dialect-diverse environments still faces challenges. The severe scarcity of Tibetan data, coupled with the phonetic differences among its major dialects (Ü-Tsang, Amdo, and Kham), is a prime example of this challenge. This paper proposes Ti-Audio, the first multi-dialectal end-to-end Speech-LLM for Tibetan. To efficiently align speech and text, we introduce a Dynamic Q-Former Adapter that extracts essential acoustic features from variable-length speech, ensuring stable cross-modal alignment even with limited data. At the data level, we leverage mutual assistance among related dialects to alleviate data scarcity and employ a temperature-based sampling strategy to maximize this synergy. Experimental results demonstrate that Ti-Audio achieves state-of-the-art performance on Tibetan benchmarks for automatic speech recognition and speech translation. Our work validates the effectiveness of cross-dialectal cooperation and provides a scalable paradigm for the development of Speech-LLM in low-resource scenarios.

Problem

Research questions and friction points this paper is trying to address.

low-resource

dialect-diverse

Tibetan

Speech-LLM

data scarcity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Speech-LLM

multi-dialectal

Dynamic Q-Former Adapter

low-resource

cross-dialectal cooperation

🔎 Similar Papers

No similar papers found.