π€ AI Summary
Existing lip-reading methods suffer from degraded accuracy on variable-speed videos due to loss of fine-grained temporal information. To address this challenge in event-camera-based lip reading, we propose a multi-view temporal granularity alignment and aggregation framework. First, we introduce a novel time-segmented voxel grid list representation to explicitly model the local spatiotemporal structure of asynchronous event streams. Second, we design a granularity-aligned fusion module that enables cross-granularity collaboration between global spatial features from event frames and local spatiotemporal features from voxel gridsβmarking the first such integration. Third, we enhance absolute spatial encoding and long-range temporal modeling via positional encoding, and synergistically combine graph neural networks with CNNs. Evaluated on mainstream event-based lip-reading benchmarks, our method significantly outperforms existing event-driven and frame-based approaches, demonstrating that fine-grained temporal modeling is critical for advancing lip-reading performance.
π Abstract
Lip-reading is to utilize the visual information of the speaker's lip movements to recognize words and sentences. Existing event-based lip-reading solutions integrate different frame rate branches to learn spatio-temporal features of varying granularities. However, aggregating events into event frames inevitably leads to the loss of fine-grained temporal information within frames. To remedy this drawback, we propose a novel framework termed Multi-view Temporal Granularity aligned Aggregation (MTGA). Specifically, we first present a novel event representation method, namely time-segmented voxel graph list, where the most significant local voxels are temporally connected into a graph list. Then we design a spatio-temporal fusion module based on temporal granularity alignment, where the global spatial features extracted from event frames, together with the local relative spatial and temporal features contained in voxel graph list are effectively aligned and integrated. Finally, we design a temporal aggregation module that incorporates positional encoding, which enables the capture of local absolute spatial and global temporal information. Experiments demonstrate that our method outperforms both the event-based and video-based lip-reading counterparts.