🤖 AI Summary
This work addresses the challenge of visual localization ambiguity in minimalist floorplans caused by repetitive structures. To resolve this issue without relying on semantic annotations, the authors propose a deep learning approach that leverages depth estimation to generate multiple ray-based pose hypotheses. A dual-level contrastive learning framework—operating at both position and orientation levels—is introduced to rigorously align visual features with the geometric structure of the floorplan, thereby effectively disambiguating pose estimates. Evaluated on two standard benchmarks, the method significantly outperforms existing semantic-dependent approaches, achieving notable improvements in both localization accuracy and robustness.
📝 Abstract
Since floorplan data is readily available, long-term persistent, and robust to changes in visual appearance, visual Floorplan Localization (FLoc) has garnered significant attention. Existing methods either ingeniously match geometric priors or utilize sparse semantics to reduce FLoc uncertainty. However, they still suffer from ambiguous FLoc caused by repetitive structures within minimalist floorplans. Moreover, expensive but limited semantic annotations restrict their applicability. To address these issues, we propose DisCo-FLoc, which utilizes dual-level visual-geometric Contrasts to Disambiguate depth-aware visual Floc, without requiring additional semantic labels. Our solution begins with a ray regression predictor tailored for ray-casting-based FLoc, predicting a series of FLoc candidates using depth estimation expertise. In addition, a novel contrastive learning method with position-level and orientation-level constraints is proposed to strictly match depth-aware visual features with the corresponding geometric structures in the floorplan. Such matches can effectively eliminate FLoc ambiguity and select the optimal imaging pose from FLoc candidates. Exhaustive comparative studies on two standard visual Floc benchmarks demonstrate that our method outperforms the state-of-the-art semantic-based method, achieving significant improvements in both robustness and accuracy.