Temporal-Guided Visual Foundation Models for Event-Based Vision

📅 2025-11-09

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Event cameras excel in complex scenes but pose challenges for modeling asynchronous event streams; existing approaches struggle to efficiently leverage pre-trained vision foundation models (VFMs). This paper proposes Temporal-Guided VFM, the first framework enabling cross-modal transfer of VFMs to event-based vision. It integrates long-range temporal attention, dual spatiotemporal attention, and deep feature guidance to jointly encode semantic content and temporal dynamics. The framework takes event-to-video reconstructions as input and incorporates a Transformer-based VFM backbone augmented with a temporal context fusion module, fine-tuned on real-world event data. Evaluated on semantic segmentation, depth estimation, and object detection, it achieves +16%, +21%, and +16% improvements over prior methods, respectively, establishing new state-of-the-art performance.

Technology Category

Application Category

📝 Abstract

Event cameras offer unique advantages for vision tasks in challenging environments, yet processing asynchronous event streams remains an open challenge. While existing methods rely on specialized architectures or resource-intensive training, the potential of leveraging modern Visual Foundation Models (VFMs) pretrained on image data remains under-explored for event-based vision. To address this, we propose Temporal-Guided VFM (TGVFM), a novel framework that integrates VFMs with our temporal context fusion block seamlessly to bridge this gap. Our temporal block introduces three key components: (1) Long-Range Temporal Attention to model global temporal dependencies, (2) Dual Spatiotemporal Attention for multi-scale frame correlation, and (3) Deep Feature Guidance Mechanism to fuse semantic-temporal features. By retraining event-to-video models on real-world data and leveraging transformer-based VFMs, TGVFM preserves spatiotemporal dynamics while harnessing pretrained representations. Experiments demonstrate SoTA performance across semantic segmentation, depth estimation, and object detection, with improvements of 16%, 21%, and 16% over existing methods, respectively. Overall, this work unlocks the cross-modality potential of image-based VFMs for event-based vision with temporal reasoning. Code is available at https://github.com/XiaRho/TGVFM.

Problem

Research questions and friction points this paper is trying to address.

Processing asynchronous event streams for vision tasks in challenging environments

Leveraging pretrained Visual Foundation Models for event-based vision applications

Integrating temporal reasoning with visual representations for cross-modality learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates Visual Foundation Models with temporal fusion block

Uses Long-Range Temporal Attention for global dependencies

Combines Dual Spatiotemporal Attention with Deep Feature Guidance

🔎 Similar Papers

No similar papers found.