π€ AI Summary
This work addresses the limitations of conventional RGB videoβbased speech generation, which suffers from insufficient temporal resolution and fixed exposure settings that hinder the capture of high-frequency articulatory transients, leading to ambiguous emotional expression. To overcome this, the authors propose EventSpeech, a novel framework that leverages microsecond-resolution neuromorphic event streams for text-conditioned emotional speech synthesis. The approach integrates an event encoder, a multi-scale audio encoder, and a hierarchical wavelet context module, complemented by a bidirectional alignment mechanism to synchronize linguistic content, visual dynamics, and acoustic features. The study introduces EVT-SPK, the first neuromorphic speech benchmark, demonstrating superior performance over existing methods in preserving emotional nuance and mitigating motion blur, thereby establishing a new paradigm for event-driven multimodal speech generation.
π Abstract
Traditional RGB-based speech generation faces Temporal Granularity Mismatch since fixed camera exposure times inevitably blur the high-frequency articulatory transients essential for rendering emotional speech. To break this ceiling, we propose EventSpeech as a novel text-conditioned framework pioneering the use of neuromorphic events for expressive speech generation, since these microsecond-precise events naturally align with acoustic waveform dynamics. Our architecture integrates a dedicated Event Encoder to model sparse neuromorphic events alongside a multi-scale Audio Encoder featuring a Hierarchical Wavelet Contextualizer (HWC). A bidirectional alignment mechanism seamlessly synchronizes linguistic content and visual dynamics with dense acoustic features. Furthermore, we construct EVT-SPK as the first benchmark comprising large-scale synthetic data and real-world recordings from specialized neuromorphic hardware. Extensive evaluations demonstrate that EventSpeech significantly outperforms current baselines by preserving fine-grained emotions and resisting motion blur to establish a new paradigm for multimodal speech generation. Code and demo are available at https://xrfang-0102.github.io/EventSpeechWeb/.