EvMic: Event-based Non-contact sound recovery from effective spatial-temporal modeling

📅 2025-04-03

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Existing event-camera-based vibration recovery methods suffer from limited sound reconstruction fidelity, failing to fully exploit spatiotemporal information embedded in event streams. To address this, we propose the first spatiotemporal joint modeling framework for contactless acoustic reconstruction from events. Our approach comprises: (i) a laser-speckle-enhanced imaging system to boost micro-vibration signal-to-noise ratio; (ii) a dedicated neural network integrating sparse event representation, spatial aggregation, and Mamba-based long-range temporal modeling; and (iii) a physics-informed data synthesis pipeline for realistic event-stream generation. Evaluated on both synthetic and real-world benchmarks, our method significantly outperforms state-of-the-art approaches, enabling high-fidelity reconstruction of speech and music signals—with an average SNR improvement of 9.2 dB. It breaks the classical optical-acoustic trade-off among sampling rate, bandwidth, and field of view.

Technology Category

Application Category

📝 Abstract

When sound waves hit an object, they induce vibrations that produce high-frequency and subtle visual changes, which can be used for recovering the sound. Early studies always encounter trade-offs related to sampling rate, bandwidth, field of view, and the simplicity of the optical path. Recent advances in event camera hardware show good potential for its application in visual sound recovery, because of its superior ability in capturing high-frequency signals. However, existing event-based vibration recovery methods are still sub-optimal for sound recovery. In this work, we propose a novel pipeline for non-contact sound recovery, fully utilizing spatial-temporal information from the event stream. We first generate a large training set using a novel simulation pipeline. Then we designed a network that leverages the sparsity of events to capture spatial information and uses Mamba to model long-term temporal information. Lastly, we train a spatial aggregation block to aggregate information from different locations to further improve signal quality. To capture event signals caused by sound waves, we also designed an imaging system using a laser matrix to enhance the gradient and collected multiple data sequences for testing. Experimental results on synthetic and real-world data demonstrate the effectiveness of our method.

Problem

Research questions and friction points this paper is trying to address.

Recovering sound from high-frequency visual vibrations using event cameras

Overcoming trade-offs in sampling rate, bandwidth, and optical simplicity

Enhancing spatial-temporal modeling for accurate non-contact sound recovery

Innovation

Methods, ideas, or system contributions that make the work stand out.

Event camera captures high-frequency sound vibrations

Spatial-temporal network with Mamba for long-term modeling

Laser matrix imaging enhances gradient for signal quality

🔎 Similar Papers

No similar papers found.