🤖 AI Summary
This work addresses the performance degradation of RGB-D direct sparse visual odometry in complex scenarios involving dynamic objects, occlusions, illumination changes, and unreliable depth measurements, where short-term photometric and geometric consistency assumptions break down. The authors propose Con-DSO, a novel framework that, for the first time, integrates data-driven modeling of consistency uncertainty into RGB-D direct methods. By learning dense photometric and depth-based geometric consistency errors between consecutive frames, Con-DSO generates quality priors that guide pixel selection and enable decoupled photometric–geometric weighting during keyframe tracking, thereby continuously attenuating unreliable observations. This unified approach handles multiple failure modes without relying on external modules or handcrafted rules. Evaluated on five public benchmarks, Con-DSO significantly outperforms baseline methods, reducing absolute trajectory error by over 20% on ICL-NUIM and by 50%–80% on RGB-D Scenes V2, TUM/Bonn Dynamic, and OpenLORIS sequences.
📝 Abstract
Visual odometry (VO) is a fundamental component in robotics and augmented reality. RGB-D direct VO benefits from metric depth measurements, but it can degrade in challenging environments, where dynamic objects, occlusions, illumination changes, and unreliable depth violate the short-horizon photometric and depth-geometric consistency assumptions used by direct alignment. Existing approaches mitigate these issues through semantic filtering, explicit occlusion reasoning, illumination adaptation, or hand-crafted geometric criteria, but often rely on external modules or fixed assumptions tailored to individual failure modes, limiting their flexibility and ability to handle diverse challenges in a unified manner. In this work, we propose Con-DSO, a consistency-aware RGB-D direct sparse odometry framework that predicts dense photometric and depth-geometric consistency uncertainty from temporally adjacent RGB-D frame pairs. The consistency network is trained using flow-guided photometric errors and projective depth-consistency errors, allowing consistency violations to be represented as pixel-level uncertainty. These pairwise uncertainty predictions are converted into a host-side quality prior for keyframe-based tracking. The prior is then applied to VO through quality-aware support-pixel selection and decoupled photometric-geometric weighting during pose estimation, enabling continuous attenuation of unreliable observations rather than hard rejection or threshold-based gating. Experiments on five public RGB-D benchmarks show substantial gains over direct RGB-D VO baselines, with over 20\% absolute trajectory error reduction on ICL-NUIM and 50\%--80\% reductions on RGB-D Scenes V2, TUM/Bonn Dynamic, and OpenLORIS sequences.