Equipping LLM with Directional Multi-Talker Speech Understanding Capabilities

📅 2026-02-06

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses the limitation of current large language models (LLMs), which are predominantly trained on single-channel, single-speaker speech and thus struggle with directional speech understanding involving multiple speakers and channels—particularly in streaming scenarios such as smart glasses. To bridge this gap, the study introduces spatial directionality into LLM-based speech understanding for the first time, proposing two novel architectures: a cascaded approach integrating a multi-microphone array front-end for speech separation, and an end-to-end training paradigm that directly accepts serialized multi-channel inputs. By jointly leveraging spatial audio encoding, source separation, and LLM fine-tuning, the system achieves substantial performance gains in streaming speech recognition and translation, effectively enabling practical, directional multi-speaker speech comprehension with large language models.

Technology Category

Application Category

📝 Abstract

Recent studies have demonstrated that prompting large language models (LLM) with audio encodings enables effective speech understanding capabilities. However, most speech LLMs are trained on single-channel, single-talker data, which makes it challenging to directly apply them to multi-talker and multi-channel speech understanding task. In this work, we present a comprehensive investigation on how to enable directional multi-talker speech understanding capabilities for LLMs, specifically in smart glasses usecase. We propose two novel approaches to integrate directivity into LLMs: (1) a cascaded system that leverages a source separation front-end module, and (2) an end-to-end system that utilizes serialized output training. All of the approaches utilize a multi-microphone array embedded in smart glasses to optimize directivity interpretation and processing in a streaming manner. Experimental results demonstrate the efficacy of our proposed methods in endowing LLMs with directional speech understanding capabilities, achieving strong performance in both speech recognition and speech translation tasks.

Problem

Research questions and friction points this paper is trying to address.

directional speech understanding

multi-talker speech

large language models

multi-channel audio

smart glasses

Innovation

Methods, ideas, or system contributions that make the work stand out.

directional speech understanding

multi-talker speech

large language models