🤖 AI Summary
Current visual place recognition methods exhibit limited performance in cluttered indoor environments, primarily due to their neglect of object-oriented structural information. To address this, we propose an object-aware coarse-to-fine room re-identification framework that systematically integrates multi-granularity object cues—including global contextual features, instance segmentation masks, keypoint-guided local regions, and object-level representations—for the first time. Our method leverages vision transformers for multi-scale feature extraction and incorporates hierarchical feature fusion with ranking optimization, enabling modular component substitution without compromising stability. Evaluated on four new large-scale benchmarks—MPReID, HMReID, GibsonReID, and ReplicaReID—the approach consistently outperforms state-of-the-art methods, achieving mAP improvements of 6%–80%. It demonstrates significantly enhanced robustness to viewpoint variations and superior cross-scene generalization capability.
📝 Abstract
Room reidentification (ReID) is a challenging yet essential task with numerous applications in fields such as augmented reality (AR) and homecare robotics. Existing visual place recognition (VPR) methods, which typically rely on global descriptors or aggregate local features, often struggle in cluttered indoor environments densely populated with man-made objects. These methods tend to overlook the crucial role of object-oriented information. To address this, we propose AirRoom, an object-aware pipeline that integrates multi-level object-oriented information-from global context to object patches, object segmentation, and keypoints-utilizing a coarse-to-fine retrieval approach. Extensive experiments on four newly constructed datasets-MPReID, HMReID, GibsonReID, and ReplicaReID-demonstrate that AirRoom outperforms state-of-the-art (SOTA) models across nearly all evaluation metrics, with improvements ranging from 6% to 80%. Moreover, AirRoom exhibits significant flexibility, allowing various modules within the pipeline to be substituted with different alternatives without compromising overall performance. It also shows robust and consistent performance under diverse viewpoint variations.