🤖 AI Summary
This paper addresses the significant performance degradation of conventional handcrafted methods in cross-modal feature matching—e.g., RGB/depth/point cloud/LiDAR/medical images/visual-language—caused by modality heterogeneity. To this end, we propose a modality-aware unified matching framework that jointly integrates geometric-aware descriptors, sparse-dense point cloud co-modeling, attention-enhanced networks, and cross-modal alignment mechanisms. The framework is architecture-agnostic, supporting both CNN- and Transformer-based backbones, and accommodates both detector-based (e.g., SuperPoint) and detector-free (e.g., LoFTR) paradigms. We systematically survey and empirically evaluate state-of-the-art single- and cross-modal matching approaches. Extensive experiments across diverse heterogeneous modality pairs demonstrate substantial improvements in robustness, generalization, and matching accuracy. Notably, detector-free deep models exhibit superior performance in cross-modal settings, highlighting their strong adaptability to modality shifts.
📝 Abstract
Feature matching is a cornerstone task in computer vision, essential for applications such as image retrieval, stereo matching, 3D reconstruction, and SLAM. This survey comprehensively reviews modality-based feature matching, exploring traditional handcrafted methods and emphasizing contemporary deep learning approaches across various modalities, including RGB images, depth images, 3D point clouds, LiDAR scans, medical images, and vision-language interactions. Traditional methods, leveraging detectors like Harris corners and descriptors such as SIFT and ORB, demonstrate robustness under moderate intra-modality variations but struggle with significant modality gaps. Contemporary deep learning-based methods, exemplified by detector-free strategies like CNN-based SuperPoint and transformer-based LoFTR, substantially improve robustness and adaptability across modalities. We highlight modality-aware advancements, such as geometric and depth-specific descriptors for depth images, sparse and dense learning methods for 3D point clouds, attention-enhanced neural networks for LiDAR scans, and specialized solutions like the MIND descriptor for complex medical image matching. Cross-modal applications, particularly in medical image registration and vision-language tasks, underscore the evolution of feature matching to handle increasingly diverse data interactions.