SONAR: Semantic-Object Navigation with Aggregated Reasoning through a Cross-Modal Inference Paradigm

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Weak semantic cues in vision-language navigation (VLN) under unknown environments lead to poor generalization and low robustness. To address this, we propose a cross-modal reasoning framework that integrates multi-scale semantic maps with a vision-language model (VLM). Methodologically, we design a confidence-map-guided localization strategy that jointly optimizes a semantic-map-driven target prediction module and a VLM-generated value map module, thereby balancing generalization capability and scene adaptability. Additionally, we incorporate multi-scale feature fusion and confidence estimation to enhance decision reliability under weak semantic conditions. Evaluated on the Matterport3D dataset, our approach achieves a success rate of 38.4% and an SPL (Success-weighted by Path Length) of 17.7%, significantly outperforming existing methods. These results demonstrate the effectiveness and robustness of our framework in complex, unseen environments.

Technology Category

Application Category

📝 Abstract
Understanding human instructions and accomplishing Vision-Language Navigation tasks in unknown environments is essential for robots. However, existing modular approaches heavily rely on the quality of training data and often exhibit poor generalization. Vision-Language Model based methods, while demonstrating strong generalization capabilities, tend to perform unsatisfactorily when semantic cues are weak. To address these issues, this paper proposes SONAR, an aggregated reasoning approach through a cross modal paradigm. The proposed method integrates a semantic map based target prediction module with a Vision-Language Model based value map module, enabling more robust navigation in unknown environments with varying levels of semantic cues, and effectively balancing generalization ability with scene adaptability. In terms of target localization, we propose a strategy that integrates multi-scale semantic maps with confidence maps, aiming to mitigate false detections of target objects. We conducted an evaluation of the SONAR within the Gazebo simulator, leveraging the most challenging Matterport 3D (MP3D) dataset as the experimental benchmark. Experimental results demonstrate that SONAR achieves a success rate of 38.4% and an SPL of 17.7%.
Problem

Research questions and friction points this paper is trying to address.

Enhancing robot navigation in unknown environments with weak semantic cues
Integrating semantic maps with vision-language models for robust navigation
Mitigating false target detections through multi-scale semantic confidence maps
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates semantic map with vision-language model modules
Uses multi-scale semantic maps and confidence maps
Enables robust navigation in varying semantic conditions
🔎 Similar Papers
No similar papers found.
Y
Yao Wang
Shenzhen Key Laboratory of Robotics Perception and Intelligence, Department of Electronic and Electrical Engineering, Southern University of Science and Technology, Shenzhen, China.
Zhirui Sun
Zhirui Sun
Southern University of Science and Technology
Robot PerceptionPath Planning
Wenzheng Chi
Wenzheng Chi
Soochow University
Human-friendly Robot NavigationHuman Robot Motion Interaction
B
Baozhi Jia
Xiamen Key Laboratory of Visual Perception Technology and Application, Xiamen, China.
Wenjun Xu
Wenjun Xu
Peng Cheng Laboratory
machine learningreinforcement learningflexible/soft robot
Jiankun Wang
Jiankun Wang
Southern University of Science and Technology
RoboticsPath PlanningMotion ControlHuman-Robot Interaction