Thinking in Directivity: Speech Large Language Model for Multi-Talker Directional Speech Recognition

📅 2025-06-17

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Existing speech large language models (Speech LLMs) largely neglect spatial audio modeling, limiting their capability to support directional speech recognition, sound source localization, and bystander crosstalk suppression—critical for multi-microphone wearable devices such as smart glasses. To address this gap, we propose the first spatially aware Speech LLM framework built upon the Llama architecture. Our approach integrates multi-channel beamformed acoustic features, direction-aware positional encoding, contrastive directional data augmentation (CDDA), and sequence-based directional output training (S-DOT). By jointly optimizing spatial direction perception and automatic speech recognition in an end-to-end manner, our model achieves significant improvements: a 4.2% absolute reduction in word error rate (WER) and a 12.6° decrease in mean angular error (MAE) for sound source localization, while effectively suppressing off-axis crosstalk. This work pioneers fine-grained spatial direction modeling deeply embedded within Speech LLMs, establishing a new paradigm for spatially intelligent voice interaction on wearables.

Technology Category

Application Category

📝 Abstract

Recent studies have demonstrated that prompting large language models (LLM) with audio encodings enables effective speech recognition capabilities. However, the ability of Speech LLMs to comprehend and process multi-channel audio with spatial cues remains a relatively uninvestigated area of research. In this work, we present directional-SpeechLlama, a novel approach that leverages the microphone array of smart glasses to achieve directional speech recognition, source localization, and bystander cross-talk suppression. To enhance the model's ability to understand directivity, we propose two key techniques: serialized directional output training (S-DOT) and contrastive direction data augmentation (CDDA). Experimental results show that our proposed directional-SpeechLlama effectively captures the relationship between textual cues and spatial audio, yielding strong performance in both speech recognition and source localization tasks.

Problem

Research questions and friction points this paper is trying to address.

Enables directional speech recognition using smart glasses

Processes multi-channel audio with spatial cues effectively

Improves speech recognition and source localization accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages microphone array for directional speech recognition

Uses serialized directional output training (S-DOT)

Employs contrastive direction data augmentation (CDDA)

🔎 Similar Papers

Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions