Enhancing Non-Core Language Instruction-Following in Speech LLMs via Semi-Implicit Cross-Lingual CoT Reasoning

📅 2025-04-29

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Current speech large language models (SLLMs) exhibit significantly limited instruction-following capabilities for non-English languages (e.g., Japanese, German, French), primarily due to the scarcity of high-quality speech-text paired data and insufficient cross-lingual semantic reasoning ability. To address this, we propose XS-CoT (Cross-lingual Speech Chain-of-Thought), a semi-implicit cross-lingual speech reasoning framework that innovatively embeds speech-to-text translation within the reasoning chain—enabling efficient transfer of English’s strong reasoning capacity to low-resource languages. Our method integrates multimodal modeling, semi-implicit intermediate representation compression, and multilingual instruction fine-tuning. We further construct the first Japanese/German/French multilingual speech instruction dataset. Evaluated on Qwen2-Audio and SALMONN, XS-CoT achieves a 45% improvement in GPT-4 scoring, reduces reasoning token latency by over 50%, and requires only a small amount of high-quality non-English speech data for training.

Technology Category

Application Category

📝 Abstract

Large language models have been extended to the speech domain, leading to the development of speech large language models (SLLMs). While existing SLLMs demonstrate strong performance in speech instruction-following for core languages (e.g., English), they often struggle with non-core languages due to the scarcity of paired speech-text data and limited multilingual semantic reasoning capabilities. To address this, we propose the semi-implicit Cross-lingual Speech Chain-of-Thought (XS-CoT) framework, which integrates speech-to-text translation into the reasoning process of SLLMs. The XS-CoT generates four types of tokens: instruction and response tokens in both core and non-core languages, enabling cross-lingual transfer of reasoning capabilities. To mitigate inference latency in generating target non-core response tokens, we incorporate a semi-implicit CoT scheme into XS-CoT, which progressively compresses the first three types of intermediate reasoning tokens while retaining global reasoning logic during training. By leveraging the robust reasoning capabilities of the core language, XS-CoT improves responses for non-core languages by up to 45% in GPT-4 score when compared to direct supervised fine-tuning on two representative SLLMs, Qwen2-Audio and SALMONN. Moreover, the semi-implicit XS-CoT reduces token delay by more than 50% with a slight drop in GPT-4 scores. Importantly, XS-CoT requires only a small amount of high-quality training data for non-core languages by leveraging the reasoning capabilities of core languages. To support training, we also develop a data pipeline and open-source speech instruction-following datasets in Japanese, German, and French.

Problem

Research questions and friction points this paper is trying to address.

Improving instruction-following in non-core languages for speech LLMs

Addressing data scarcity and multilingual reasoning in speech models

Reducing inference latency while maintaining cross-lingual reasoning accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semi-implicit Cross-lingual Speech Chain-of-Thought framework

Generates core and non-core language tokens

Reduces inference latency by 50%

🔎 Similar Papers

No similar papers found.