VibOmni: Towards Scalable Bone-conduction Speech Enhancement on Earables

📅 2025-12-02

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

Ear-worn devices suffer from limited speech enhancement performance in noisy environments, primarily because omnidirectional microphones alone struggle to suppress competing speakers and other non-stationary noise. This paper introduces VibOmni: a lightweight, end-to-end multimodal speech enhancement system that pioneers the deep fusion of bone-conducted vibration signals—captured via inertial sensors—with acoustic inputs. Its key contributions are: (1) a data augmentation method modeling the bone conduction transfer function from limited training data to synthesize high-fidelity vibration signals; (2) a backpropagation-free multimodal signal-to-noise ratio estimator enabling on-device continual learning and adaptive inference; and (3) a dual-branch encoder-decoder architecture optimized for low-latency edge deployment. Evaluated on a real-world dataset of 32 subjects, VibOmni achieves +21% PESQ, +26% SNR improvement, and −40% WER reduction over baselines. In a user study with 35 participants, 87% preferred VibOmni, demonstrating statistically significant superiority.

Technology Category

Application Category

📝 Abstract

Earables, such as True Wireless Stereo earphones and VR/AR headsets, are increasingly popular, yet their compact design poses challenges for robust voice-related applications like telecommunication and voice assistant interactions in noisy environments. Existing speech enhancement systems, reliant solely on omnidirectional microphones, struggle with ambient noise like competing speakers. To address these issues, we propose VibOmni, a lightweight, end-to-end multi-modal speech enhancement system for earables that leverages bone-conducted vibrations captured by widely available Inertial Measurement Units (IMUs). VibOmni integrates a two-branch encoder-decoder deep neural network to fuse audio and vibration features. To overcome the scarcity of paired audio-vibration datasets, we introduce a novel data augmentation technique that models Bone Conduction Functions (BCFs) from limited recordings, enabling synthetic vibration data generation with only 4.5% spectrogram similarity error. Additionally, a multi-modal SNR estimator facilitates continual learning and adaptive inference, optimizing performance in dynamic, noisy settings without on-device back-propagation. Evaluated on real-world datasets from 32 volunteers with different devices, VibOmni achieves up to 21% improvement in Perceptual Evaluation of Speech Quality (PESQ), 26% in Signal-to-Noise Ratio (SNR), and about 40% WER reduction with much less latency on mobile devices. A user study with 35 participants showed 87% preferred VibOmni over baselines, demonstrating its effectiveness for deployment in diverse acoustic environments.

Problem

Research questions and friction points this paper is trying to address.

Enhances speech quality in noisy environments using bone-conduction

Overcomes data scarcity with synthetic vibration generation from limited recordings

Optimizes performance adaptively without on-device back-propagation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages bone-conduction vibrations from IMUs for speech enhancement

Uses data augmentation modeling Bone Conduction Functions for synthetic data

Employs multi-modal SNR estimator for adaptive inference without back-propagation

🔎 Similar Papers

Vibravox: A Dataset of French Speech Captured with Body-conduction Audio Sensors