Can We Hear from Events? Generating Speech from Event Camera

πŸ“… 2026-05-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitations of conventional RGB video–based speech generation, which suffers from insufficient temporal resolution and fixed exposure settings that hinder the capture of high-frequency articulatory transients, leading to ambiguous emotional expression. To overcome this, the authors propose EventSpeech, a novel framework that leverages microsecond-resolution neuromorphic event streams for text-conditioned emotional speech synthesis. The approach integrates an event encoder, a multi-scale audio encoder, and a hierarchical wavelet context module, complemented by a bidirectional alignment mechanism to synchronize linguistic content, visual dynamics, and acoustic features. The study introduces EVT-SPK, the first neuromorphic speech benchmark, demonstrating superior performance over existing methods in preserving emotional nuance and mitigating motion blur, thereby establishing a new paradigm for event-driven multimodal speech generation.
πŸ“ Abstract
Traditional RGB-based speech generation faces Temporal Granularity Mismatch since fixed camera exposure times inevitably blur the high-frequency articulatory transients essential for rendering emotional speech. To break this ceiling, we propose EventSpeech as a novel text-conditioned framework pioneering the use of neuromorphic events for expressive speech generation, since these microsecond-precise events naturally align with acoustic waveform dynamics. Our architecture integrates a dedicated Event Encoder to model sparse neuromorphic events alongside a multi-scale Audio Encoder featuring a Hierarchical Wavelet Contextualizer (HWC). A bidirectional alignment mechanism seamlessly synchronizes linguistic content and visual dynamics with dense acoustic features. Furthermore, we construct EVT-SPK as the first benchmark comprising large-scale synthetic data and real-world recordings from specialized neuromorphic hardware. Extensive evaluations demonstrate that EventSpeech significantly outperforms current baselines by preserving fine-grained emotions and resisting motion blur to establish a new paradigm for multimodal speech generation. Code and demo are available at https://xrfang-0102.github.io/EventSpeechWeb/.
Problem

Research questions and friction points this paper is trying to address.

Temporal Granularity Mismatch
emotional speech
high-frequency articulatory transients
motion blur
speech generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

event camera
neuromorphic events
speech generation
temporal granularity
multimodal alignment
πŸ”Ž Similar Papers
No similar papers found.