FUSE: Label-Free Image-Event Joint Monocular Depth Estimation via Frequency-Decoupled Alignment and Degradation-Robust Fusion

📅 2025-03-25

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Image-event joint monocular depth estimation faces two key challenges: (1) the scarcity of cross-modal (image/event/depth) labeled data, leading to supervision deficiency, and (2) intrinsic frequency mismatch between images and event streams, hindering effective feature fusion. This paper proposes a depth-label-free framework featuring Frequency-Decoupled Fusion (FreDFuse) and Parameter-Efficient Self-supervised Transfer (PST), enabling zero-shot generalization to extreme illumination conditions and motion-blurred scenes. Our method comprises three core components: foundation-model-driven latent-space alignment, physics-aware degradation-robust fusion, and lightweight decoder adaptation. Evaluated on MVSEC and DENSE benchmarks, it achieves state-of-the-art performance—reducing Abs.Rel error by 14.0% and 24.9%, respectively. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

Image-event joint depth estimation methods leverage complementary modalities for robust perception, yet face challenges in generalizability stemming from two factors: 1) limited annotated image-event-depth datasets causing insufficient cross-modal supervision, and 2) inherent frequency mismatches between static images and dynamic event streams with distinct spatiotemporal patterns, leading to ineffective feature fusion. To address this dual challenge, we propose Frequency-decoupled Unified Self-supervised Encoder (FUSE) with two synergistic components: The Parameter-efficient Self-supervised Transfer (PST) establishes cross-modal knowledge transfer through latent space alignment with image foundation models, effectively mitigating data scarcity by enabling joint encoding without depth ground truth.Complementing this, we propose the Frequency-Decoupled Fusion module (FreDFuse) to explicitly decouple high-frequency edge features from low-frequency structural components, resolving modality-specific frequency mismatches through physics-aware fusion. This combined approach enables FUSE to construct a universal image-event encoder that only requires lightweight decoder adaptation for target datasets. Extensive experiments demonstrate state-of-the-art performance with 14% and 24.9% improvements in Abs.Rel on MVSEC and DENSE datasets. The framework exhibits remarkable zero-shot adaptability to challenging scenarios including extreme lighting and motion blur, significantly advancing real-world deployment capabilities. The source code for our method is publicly available at: https://github.com/sunpihai-up/FUSE

Problem

Research questions and friction points this paper is trying to address.

Addresses limited annotated image-event-depth datasets for cross-modal supervision

Resolves frequency mismatches between static images and dynamic event streams

Enhances real-world deployment in extreme lighting and motion blur scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Frequency-decoupled alignment for modality fusion

Self-supervised transfer to mitigate data scarcity

Physics-aware fusion for frequency mismatch resolution

🔎 Similar Papers

Self-supervised Monocular Depth Estimation Based on Hierarchical Feature-Guided Diffusion