Spatially-guided Temporal Aggregation for Robust Event-RGB Optical Flow Estimation

📅 2025-01-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limited accuracy and efficiency of event-RGB optical flow estimation under low-light and high-speed motion conditions, this paper proposes a spatially guided cross-modal temporal aggregation method. Our core innovation introduces, for the first time, a fusion paradigm wherein spatially dense RGB frames guide the temporal aggregation of dense event streams. We design an event-enhanced frame representation, a Transformer-based feature completion module, and a hybrid-modal encoder to jointly model frame texture stability and event-based high temporal resolution. The resulting end-to-end network achieves state-of-the-art performance on DSEC-Flow: it improves accuracy by 10% over pure-event methods and by 4% over the best existing fusion approaches, while accelerating inference by 45%.

Technology Category

Application Category

📝 Abstract
Current optical flow methods exploit the stable appearance of frame (or RGB) data to establish robust correspondences across time. Event cameras, on the other hand, provide high-temporal-resolution motion cues and excel in challenging scenarios. These complementary characteristics underscore the potential of integrating frame and event data for optical flow estimation. However, most cross-modal approaches fail to fully utilize the complementary advantages, relying instead on simply stacking information. This study introduces a novel approach that uses a spatially dense modality to guide the aggregation of the temporally dense event modality, achieving effective cross-modal fusion. Specifically, we propose an event-enhanced frame representation that preserves the rich texture of frames and the basic structure of events. We use the enhanced representation as the guiding modality and employ events to capture temporally dense motion information. The robust motion features derived from the guiding modality direct the aggregation of motion information from events. To further enhance fusion, we propose a transformer-based module that complements sparse event motion features with spatially rich frame information and enhances global information propagation. Additionally, a mix-fusion encoder is designed to extract comprehensive spatiotemporal contextual features from both modalities. Extensive experiments on the MVSEC and DSEC-Flow datasets demonstrate the effectiveness of our framework. Leveraging the complementary strengths of frames and events, our method achieves leading performance on the DSEC-Flow dataset. Compared to the event-only model, frame guidance improves accuracy by 10%. Furthermore, it outperforms the state-of-the-art fusion-based method with a 4% accuracy gain and a 45% reduction in inference time.
Problem

Research questions and friction points this paper is trying to address.

Optical Flow Estimation
Event-based RGB Fusion
Accuracy and Efficiency Limitations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dense Image-Event Fusion
RGB Optical Flow Estimation
Efficient Motion Capture
🔎 Similar Papers
No similar papers found.