MTGA: Multi-view Temporal Granularity aligned Aggregation for Event-based Lip-reading

πŸ“… 2024-04-18
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing lip-reading methods suffer from degraded accuracy on variable-speed videos due to loss of fine-grained temporal information. To address this challenge in event-camera-based lip reading, we propose a multi-view temporal granularity alignment and aggregation framework. First, we introduce a novel time-segmented voxel grid list representation to explicitly model the local spatiotemporal structure of asynchronous event streams. Second, we design a granularity-aligned fusion module that enables cross-granularity collaboration between global spatial features from event frames and local spatiotemporal features from voxel gridsβ€”marking the first such integration. Third, we enhance absolute spatial encoding and long-range temporal modeling via positional encoding, and synergistically combine graph neural networks with CNNs. Evaluated on mainstream event-based lip-reading benchmarks, our method significantly outperforms existing event-driven and frame-based approaches, demonstrating that fine-grained temporal modeling is critical for advancing lip-reading performance.

Technology Category

Application Category

πŸ“ Abstract
Lip-reading is to utilize the visual information of the speaker's lip movements to recognize words and sentences. Existing event-based lip-reading solutions integrate different frame rate branches to learn spatio-temporal features of varying granularities. However, aggregating events into event frames inevitably leads to the loss of fine-grained temporal information within frames. To remedy this drawback, we propose a novel framework termed Multi-view Temporal Granularity aligned Aggregation (MTGA). Specifically, we first present a novel event representation method, namely time-segmented voxel graph list, where the most significant local voxels are temporally connected into a graph list. Then we design a spatio-temporal fusion module based on temporal granularity alignment, where the global spatial features extracted from event frames, together with the local relative spatial and temporal features contained in voxel graph list are effectively aligned and integrated. Finally, we design a temporal aggregation module that incorporates positional encoding, which enables the capture of local absolute spatial and global temporal information. Experiments demonstrate that our method outperforms both the event-based and video-based lip-reading counterparts.
Problem

Research questions and friction points this paper is trying to address.

Lip-reading
Variable Video Speeds
Accuracy Degradation
Innovation

Methods, ideas, or system contributions that make the work stand out.

MTGA
Multi-view Temporal Alignment
Lip-reading Accuracy
πŸ”Ž Similar Papers
No similar papers found.
W
Wenhao Zhang
School of Computer Science, Wuhan University, Wuhan, China
J
Jun Wang
School of Computer Science, Wuhan University, Wuhan, China
Yong Luo
Yong Luo
Wuhan University
Artifical IntelligenceMachine LearningData MiningPattern Classification and Search
L
Lei Yu
School of Electronic Information, Wuhan University, Wuhan, China
W
Wei Yu
School of Computer Science, Wuhan University, Wuhan, China
Zheng He
Zheng He
University of British Columbia
deep learningmachine learning