🤖 AI Summary
Smart glasses suffer from severe degradation in wearer speech recognition (WSR) performance and error propagation in downstream NLP tasks due to interference from bystander speech (side-talk) in real-world environments. To address this, we propose a multi-channel differential automatic speech recognition (ASR) framework. Our method innovatively integrates beamforming, dynamic microphone selection, and a lightweight lateral speech detection model to generate robust differential signals at the front end. Moreover, we are the first to embed the differential mechanism into the end-to-end ASR joint optimization pipeline, enabling co-optimized interference suppression and speech recognition. Evaluated on both simulated and real-world datasets, our framework achieves up to a 18.0% relative reduction in word error rate (WER), significantly enhancing the stability and practicality of WSR under challenging acoustic conditions.
📝 Abstract
With the growing adoption of wearable devices such as smart glasses for AI assistants, wearer speech recognition (WSR) is becoming increasingly critical to next-generation human-computer interfaces. However, in real environments, interference from side-talk speech remains a significant challenge to WSR and may cause accumulated errors for downstream tasks such as natural language processing. In this work, we introduce a novel multi-channel differential automatic speech recognition (ASR) method for robust WSR on smart glasses. The proposed system takes differential inputs from different frontends that complement each other to improve the robustness of WSR, including a beamformer, microphone selection, and a lightweight side-talk detection model. Evaluations on both simulated and real datasets demonstrate that the proposed system outperforms the traditional approach, achieving up to an 18.0% relative reduction in word error rate.