🤖 AI Summary
Existing visual odometry (VO) methods predominantly rely on two-frame tracking, neglecting temporal context across image sequences. This limitation hinders global motion modeling and trajectory reliability estimation, leading to significant performance degradation under occlusion, dynamic objects, and low-texture conditions. To address this, we propose the first long-horizon, arbitrary-point tracking frontend that jointly exploits visual features, inter-trajectory associations, and temporal evolution cues. Our method introduces a temporal probabilistic modeling framework coupled with a learnable iterative optimization module for per-point uncertainty inference. Key components include multi-cue deep tracking, temporal Bayesian distribution updating, differentiable iterative refinement, and a dynamic anchor selection mechanism. Evaluated on mainstream VO benchmarks, our approach consistently outperforms state-of-the-art methods, achieving substantial improvements in localization robustness and accuracy—particularly in challenging occluded, dynamic, and texture-deprived scenarios.
📝 Abstract
Visual odometry estimates the motion of a moving cam-era based on visual input. Existing methods, mostly focusing on two-view point tracking, often ignore the rich tempo-ral context in the image sequence, thereby overlooking the global motion patterns and providing no assessment of the full trajectory reliability. These shortcomings hinder per-formance in scenarios with occlusion, dynamic objects, and low-texture areas. To address these challenges, we present the Long-term Effective Any Point Tracking (LEAP) mod-ule. LEAP innovatively combines visual, inter-track, and temporal cues with mindfully selected anchors for dynamic track estimation. Moreover, LEAP's temporal probabilistic formulation integrates distribution updates into a learnable iterative refinement module to reason about point-wise un-certainty. Based on these traits, we develop LEAP-VO, a robust visual odometry system adept at handling occlusions and dynamic scenes. Our mindful integration showcases a novel practice by employing long-term point tracking as the front-end. Extensive experiments demonstrate that the pro-posed pipeline significantly outperforms existing baselines across various visual odometry benchmarks.