Spatial Audio Processing with Large Language Model on Wearable Devices

📅 2025-04-11

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses the challenge of integrating spatial contextual information with large language models (LLMs) to enhance spatial speech understanding and adaptive interaction on wearable devices. We propose a novel architecture combining single-microphone micro-structural direction-of-arrival (DoA) estimation with LLM-based cross-modal alignment, and introduce OmniTalk—the first synthetic spatial speech dataset. Our end-side lightweight spatial-language joint inference framework integrates Whisper encoding, LoRA-finetuned LLaMA-3.2 3B, and multimodal embedding fusion. Experiments demonstrate a spatial ASR directional error of only 25.72° (71% improvement over SOTA) and a word error rate (WER) of 5.3; acoustic scene analysis enables robust 5-speaker localization with a median DoA error of 16°; and the full system runs efficiently on low-power edge devices. The core contribution is the first single-channel high-accuracy DoA estimation method coupled with an LLM spatial-semantic alignment paradigm.

Technology Category

Application Category

📝 Abstract

Integrating spatial context into large language models (LLMs) has the potential to revolutionize human-computer interaction, particularly in wearable devices. In this work, we present a novel system architecture that incorporates spatial speech understanding into LLMs, enabling contextually aware and adaptive applications for wearable technologies. Our approach leverages microstructure-based spatial sensing to extract precise Direction of Arrival (DoA) information using a monaural microphone. To address the lack of existing dataset for microstructure-assisted speech recordings, we synthetically create a dataset called OmniTalk by using the LibriSpeech dataset. This spatial information is fused with linguistic embeddings from OpenAI's Whisper model, allowing each modality to learn complementary contextual representations. The fused embeddings are aligned with the input space of LLaMA-3.2 3B model and fine-tuned with lightweight adaptation technique LoRA to optimize for on-device processing. SING supports spatially-aware automatic speech recognition (ASR), achieving a mean error of $25.72^circ$-a substantial improvement compared to the 88.52$^circ$ median error in existing work-with a word error rate (WER) of 5.3. SING also supports soundscaping, for example, inference how many people were talking and their directions, with up to 5 people and a median DoA error of 16$^circ$. Our system demonstrates superior performance in spatial speech understanding while addressing the challenges of power efficiency, privacy, and hardware constraints, paving the way for advanced applications in augmented reality, accessibility, and immersive experiences.

Problem

Research questions and friction points this paper is trying to address.

Enabling spatial speech understanding in wearable devices using LLMs

Addressing lack of dataset for microstructure-assisted speech recordings

Optimizing on-device processing with lightweight adaptation techniques

Innovation

Methods, ideas, or system contributions that make the work stand out.

Microstructure-based monaural DoA sensing

Synthetic dataset OmniTalk for spatial speech

LoRA-optimized LLaMA-3.2 with Whisper fusion

🔎 Similar Papers

TF-Mamba: A Time-Frequency Network for Sound Source Localization