🤖 AI Summary
Existing speech large language models (Speech LLMs) largely neglect spatial audio modeling, limiting their capability to support directional speech recognition, sound source localization, and bystander crosstalk suppression—critical for multi-microphone wearable devices such as smart glasses. To address this gap, we propose the first spatially aware Speech LLM framework built upon the Llama architecture. Our approach integrates multi-channel beamformed acoustic features, direction-aware positional encoding, contrastive directional data augmentation (CDDA), and sequence-based directional output training (S-DOT). By jointly optimizing spatial direction perception and automatic speech recognition in an end-to-end manner, our model achieves significant improvements: a 4.2% absolute reduction in word error rate (WER) and a 12.6° decrease in mean angular error (MAE) for sound source localization, while effectively suppressing off-axis crosstalk. This work pioneers fine-grained spatial direction modeling deeply embedded within Speech LLMs, establishing a new paradigm for spatially intelligent voice interaction on wearables.
📝 Abstract
Recent studies have demonstrated that prompting large language models (LLM) with audio encodings enables effective speech recognition capabilities. However, the ability of Speech LLMs to comprehend and process multi-channel audio with spatial cues remains a relatively uninvestigated area of research. In this work, we present directional-SpeechLlama, a novel approach that leverages the microphone array of smart glasses to achieve directional speech recognition, source localization, and bystander cross-talk suppression. To enhance the model's ability to understand directivity, we propose two key techniques: serialized directional output training (S-DOT) and contrastive direction data augmentation (CDDA). Experimental results show that our proposed directional-SpeechLlama effectively captures the relationship between textual cues and spatial audio, yielding strong performance in both speech recognition and source localization tasks.