🤖 AI Summary
Current speech large language models (SLLMs) exhibit significantly limited instruction-following capabilities for non-English languages (e.g., Japanese, German, French), primarily due to the scarcity of high-quality speech-text paired data and insufficient cross-lingual semantic reasoning ability. To address this, we propose XS-CoT (Cross-lingual Speech Chain-of-Thought), a semi-implicit cross-lingual speech reasoning framework that innovatively embeds speech-to-text translation within the reasoning chain—enabling efficient transfer of English’s strong reasoning capacity to low-resource languages. Our method integrates multimodal modeling, semi-implicit intermediate representation compression, and multilingual instruction fine-tuning. We further construct the first Japanese/German/French multilingual speech instruction dataset. Evaluated on Qwen2-Audio and SALMONN, XS-CoT achieves a 45% improvement in GPT-4 scoring, reduces reasoning token latency by over 50%, and requires only a small amount of high-quality non-English speech data for training.
📝 Abstract
Large language models have been extended to the speech domain, leading to the development of speech large language models (SLLMs). While existing SLLMs demonstrate strong performance in speech instruction-following for core languages (e.g., English), they often struggle with non-core languages due to the scarcity of paired speech-text data and limited multilingual semantic reasoning capabilities. To address this, we propose the semi-implicit Cross-lingual Speech Chain-of-Thought (XS-CoT) framework, which integrates speech-to-text translation into the reasoning process of SLLMs. The XS-CoT generates four types of tokens: instruction and response tokens in both core and non-core languages, enabling cross-lingual transfer of reasoning capabilities. To mitigate inference latency in generating target non-core response tokens, we incorporate a semi-implicit CoT scheme into XS-CoT, which progressively compresses the first three types of intermediate reasoning tokens while retaining global reasoning logic during training. By leveraging the robust reasoning capabilities of the core language, XS-CoT improves responses for non-core languages by up to 45% in GPT-4 score when compared to direct supervised fine-tuning on two representative SLLMs, Qwen2-Audio and SALMONN. Moreover, the semi-implicit XS-CoT reduces token delay by more than 50% with a slight drop in GPT-4 scores. Importantly, XS-CoT requires only a small amount of high-quality training data for non-core languages by leveraging the reasoning capabilities of core languages. To support training, we also develop a data pipeline and open-source speech instruction-following datasets in Japanese, German, and French.