🤖 AI Summary
In vision-based object navigation, agents often exhibit unproductive wandering in cross-room scenarios due to visual ambiguity—mistaking visually similar but semantically distinct rooms. To address this, we propose a “Room Expert” module that learns implicit room-style representations via unsupervised pretraining, enabling binary判断 of whether an observed image and the target image belong to the same room. Building on this, we design two room-relation-aware navigation fusion strategies that emulate human-inspired, stage-wise reasoning. Our approach integrates unsupervised representation learning, multimodal feature alignment, and reinforcement learning–based navigation policies. Evaluated on three mainstream benchmarks, our method consistently outperforms prior state-of-the-art approaches, achieving substantial improvements in cross-room navigation success (+12.3%) and path efficiency (18.7% reduction in path length). To our knowledge, this is the first work to systematically resolve navigation failure induced by cross-room visual ambiguity.
📝 Abstract
Image-goal navigation aims to steer an agent towards the goal location specified by an image. Most prior methods tackle this task by learning a navigation policy, which extracts visual features of goal and observation images, compares their similarity and predicts actions. However, if the agent is in a different room from the goal image, it's extremely challenging to identify their similarity and infer the likely goal location, which may result in the agent wandering around. Intuitively, when humans carry out this task, they may roughly compare the current observation with the goal image, having an approximate concept of whether they are in the same room before executing the actions. Inspired by this intuition, we try to imitate human behaviour and propose a Room Expert Guided Image-Goal Navigation model (REGNav) to equip the agent with the ability to analyze whether goal and observation images are taken in the same room. Specifically, we first pre-train a room expert with an unsupervised learning technique on the self-collected unlabelled room images. The expert can extract the hidden room style information of goal and observation images and predict their relationship about whether they belong to the same room. In addition, two different fusion approaches are explored to efficiently guide the agent navigation with the room relation knowledge. Extensive experiments show that our REGNav surpasses prior state-of-the-art works on three popular benchmarks.