Can We Hear from Events? Generating Speech from Event Camera

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the limitations of conventional RGB video–based speech generation, which suffers from insufficient temporal resolution and fixed exposure settings that hinder the capture of high-frequency articulatory transients, leading to ambiguous emotional expression. To overcome this, the authors propose EventSpeech, a novel framework that leverages microsecond-resolution neuromorphic event streams for text-conditioned emotional speech synthesis. The approach integrates an event encoder, a multi-scale audio encoder, and a hierarchical wavelet context module, complemented by a bidirectional alignment mechanism to synchronize linguistic content, visual dynamics, and acoustic features. The study introduces EVT-SPK, the first neuromorphic speech benchmark, demonstrating superior performance over existing methods in preserving emotional nuance and mitigating motion blur, thereby establishing a new paradigm for event-driven multimodal speech generation.

📝 Abstract

Traditional RGB-based speech generation faces Temporal Granularity Mismatch since fixed camera exposure times inevitably blur the high-frequency articulatory transients essential for rendering emotional speech. To break this ceiling, we propose EventSpeech as a novel text-conditioned framework pioneering the use of neuromorphic events for expressive speech generation, since these microsecond-precise events naturally align with acoustic waveform dynamics. Our architecture integrates a dedicated Event Encoder to model sparse neuromorphic events alongside a multi-scale Audio Encoder featuring a Hierarchical Wavelet Contextualizer (HWC). A bidirectional alignment mechanism seamlessly synchronizes linguistic content and visual dynamics with dense acoustic features. Furthermore, we construct EVT-SPK as the first benchmark comprising large-scale synthetic data and real-world recordings from specialized neuromorphic hardware. Extensive evaluations demonstrate that EventSpeech significantly outperforms current baselines by preserving fine-grained emotions and resisting motion blur to establish a new paradigm for multimodal speech generation. Code and demo are available at https://xrfang-0102.github.io/EventSpeechWeb/.

Problem

Research questions and friction points this paper is trying to address.

Temporal Granularity Mismatch

emotional speech

high-frequency articulatory transients

motion blur

speech generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

event camera

neuromorphic events

speech generation