EventVGGT: Exploring Cross-Modal Distillation for Consistent Event-based Depth Estimation

📅 2026-03-10

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Existing unsupervised depth estimation methods for event cameras suffer from temporal inconsistency and limited accuracy due to the absence of dense ground-truth annotations and the neglect of temporal continuity in event streams. This work addresses these limitations by, for the first time, transferring the spatio-temporal and multi-view geometric priors of the vision foundation model VGGT into the event domain. We propose a triple cross-modal distillation framework comprising Cross-Modal Feature Mixing (CMFM) to fuse RGB and event features, Spatio-Temporal Feature Distillation (STFD) to transfer structural priors, and Temporal Consistency Distillation (TCD) to enhance inter-frame stability. Our method reduces the absolute depth error on EventScape by 53% (from 2.30 m to 1.06 m at 30 m range) and demonstrates strong zero-shot generalization on DENSE and MVSEC benchmarks.

Technology Category

Application Category

📝 Abstract

Event cameras offer superior sensitivity to high-speed motion and extreme lighting, making event-based monocular depth estimation a promising approach for robust 3D perception in challenging conditions. However, progress is severely hindered by the scarcity of dense depth annotations. While recent annotation-free approaches mitigate this by distilling knowledge from Vision Foundation Models (VFMs), a critical limitation persists: they process event streams as independent frames. By neglecting the inherent temporal continuity of event data, these methods fail to leverage the rich temporal priors encoded in VFMs, ultimately yielding temporally inconsistent and less accurate depth predictions. To address this, we introduce EventVGGT, a novel framework that explicitly models the event stream as a coherent video sequence. To the best of our knowledge, we are the first to distill spatio-temporal and multi-view geometric priors from the Visual Geometry Grounded Transformer (VGGT) into the event domain. We achieve this via a comprehensive tri-level distillation strategy: (i) Cross-Modal Feature Mixture (CMFM) bridges the modality gap at the output level by fusing RGB and event features to generate auxiliary depth predictions; (ii) Spatio-Temporal Feature Distillation (STFD) distills VGGT's powerful spatio-temporal representations at the feature level; and (iii) Temporal Consistency Distillation (TCD) enforces cross-frame coherence at the temporal level by aligning inter-frame depth changes. Extensive experiments demonstrate that EventVGGT consistently outperforms existing methods -- reducing the absolute mean depth error at 30m by over 53\% on EventScape (from 2.30 to 1.06) -- while exhibiting robust zero-shot generalization on the unseen DENSE and MVSEC datasets.

Problem

Research questions and friction points this paper is trying to address.

event-based depth estimation

temporal consistency

cross-modal distillation

vision foundation models

monocular depth estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Modal Distillation

Event-based Depth Estimation

Spatio-Temporal Consistency