🤖 AI Summary
This work addresses the challenge of modality gap in heterogeneous stereo matching between event cameras and frame-based cameras, where domain-specific cues are often marginalized. To overcome this limitation, the authors propose a bidirectional cross-modal prompting framework that aligns multimodal representations within a unified canonical space. By introducing a bidirectional domain projection mechanism, the method maps features from each modality into the other’s domain, thereby enabling comprehensive fusion of semantic and structural information from both event streams and intensity frames. The resulting end-to-end model effectively preserves and leverages modality-specific discriminative cues, achieving state-of-the-art performance across multiple benchmarks. Notably, it demonstrates significant improvements in matching accuracy and cross-scene generalization compared to existing approaches.
📝 Abstract
Conventional frame-based cameras capture rich contextual information but suffer from limited temporal resolution and motion blur in dynamic scenes. Event cameras offer an alternative visual representation with higher dynamic range free from such limitations. The complementary characteristics of the two modalities make event-frame asymmetric stereo promising for reliable 3D perception under fast motion and challenging illumination. However, the modality gap often leads to marginalization of domain-specific cues essential for cross-modal stereo matching. In this paper, we introduce Bi-CMPStereo, a novel bidirectional cross-modal prompting framework that fully exploits semantic and structural features from both domains for robust matching. Our approach learns finely aligned stereo representations within a target canonical space and integrates complementary representations by projecting each modality into both event and frame domains. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods in accuracy and generalization.