🤖 AI Summary
Real-world RGB-D sensors suffer from physical separation and calibration drift, causing spatial misalignment between depth maps and RGB images—severely limiting the generalizability of existing depth super-resolution (DSR) methods in practical scenarios. To address this, we propose MOMNet, the first DSR framework that operates robustly without strict cross-modal alignment. Its core innovation lies in a multi-order matching mechanism—incorporating zero-order (pixel-wise), first-order (gradient-level), and second-order (curvature-level) geometric priors—integrated with a structure-aware detector and a multi-order aggregation module. This design enables robust cross-modal feature retrieval and selective fusion, guiding adaptive RGB-to-depth information transfer via geometric priors. Consequently, MOMNet significantly improves reconstruction accuracy and robustness under misalignment. Extensive experiments on diverse synthetic and real-world misaligned datasets demonstrate state-of-the-art performance, outperforming prior methods in both reconstruction quality and stability.
📝 Abstract
Recent guided depth super-resolution methods are premised on the assumption of strictly spatial alignment between depth and RGB, achieving high-quality depth reconstruction. However, in real-world scenarios, the acquisition of strictly aligned RGB-D is hindered by inherent hardware limitations (e.g., physically separate RGB-D sensors) and unavoidable calibration drift induced by mechanical vibrations or temperature variations. Consequently, existing approaches often suffer inevitable performance degradation when applied to misaligned real-world scenes. In this paper, we propose the Multi-Order Matching Network (MOMNet), a novel alignment-free framework that adaptively retrieves and selects the most relevant information from misaligned RGB. Specifically, our method begins with a multi-order matching mechanism, which jointly performs zero-order, first-order, and second-order matching to comprehensively identify RGB information consistent with depth across multi-order feature spaces. To effectively integrate the retrieved RGB and depth, we further introduce a multi-order aggregation composed of multiple structure detectors. This strategy uses multi-order priors as prompts to facilitate the selective feature transfer from RGB to depth. Extensive experiments demonstrate that MOMNet achieves state-of-the-art performance and exhibits outstanding robustness.