🤖 AI Summary
Directional voice capture on smartphones remains challenging in noisy, reverberant environments. Method: This paper proposes a lightweight, end-to-end real-time solution comprising (i) a bio-inspired passive acoustic microstructure for direction encoding—enabling high-accuracy sound-source localization using only the two microphones in standard wired earphone controls—and (ii) a mobile-optimized lightweight neural network that decouples source separation from spatial focusing. Contribution/Results: The work introduces the first smartphone-compatible passive acoustic microstructure, requiring no additional hardware or power supply, yet outperforming conventional five-element microphone arrays with just two microphones. Experiments demonstrate a 5.0 dB SNR improvement within a 30° steering angle and real-time inference (<40 ms latency) on commercial devices including iPhone, significantly enhancing far-field speech intelligibility.
📝 Abstract
Imagine placing your smartphone on a table in a noisy restaurant and clearly capturing the voices of friends seated around you, or recording a lecturer's voice with clarity in a reverberant auditorium. We introduce SonicSieve, the first intelligent directional speech extraction system for smartphones using a bio-inspired acoustic microstructure. Our passive design embeds directional cues onto incoming speech without any additional electronics. It attaches to the in-line mic of low-cost wired earphones which can be attached to smartphones. We present an end-to-end neural network that processes the raw audio mixtures in real-time on mobile devices. Our results show that SonicSieve achieves a signal quality improvement of 5.0 dB when focusing on a 30{deg} angular region. Additionally, the performance of our system based on only two microphones exceeds that of conventional 5-microphone arrays.