🤖 AI Summary
To address the problem of navigation behavior homogenization arising from ambiguous social norms in human-robot coexistence scenarios, this paper proposes a novel paradigm of multi-action socially compliant navigation. Our method introduces: (1) the first dual-annotated, multi-environment multi-action navigation dataset; (2) a metacognitive prompting (MCP) mechanism to enhance the social reasoning capabilities of vision-language models; and (3) an integrated framework combining multi-turn dialogue modeling with multi-dimensional evaluation metrics—Action Preference Gain (APG) and Ethical Robustness (ER). Evaluated on a 789-sample test set, our approach achieves an APG of 0.595—significantly outperforming GPT-4o and Claude—and an ER safety score of 0.264, while sustaining a real-time inference speed of 1.524 FPS (over 3× real-time). The framework enables generation of multiple socially acceptable navigation strategies within a single scenario, advancing robust, norm-aware robotic navigation.
📝 Abstract
Socially compliant navigation requires robots to move safely and appropriately in human-centered environments by respecting social norms. However, social norms are often ambiguous, and in a single scenario, multiple actions may be equally acceptable. Most existing methods simplify this problem by assuming a single correct action, which limits their ability to handle real-world social uncertainty. In this work, we propose MAction-SocialNav, an efficient vision language model for socially compliant navigation that explicitly addresses action ambiguity, enabling generating multiple plausible actions within one scenario. To enhance the model's reasoning capability, we introduce a novel meta-cognitive prompt (MCP) method. Furthermore, to evaluate the proposed method, we curate a multi-action socially compliant navigation dataset that accounts for diverse conditions, including crowd density, indoor and outdoor environments, and dual human annotations. The dataset contains 789 samples, each with three-turn conversation, split into 710 training samples and 79 test samples through random selection. We also design five evaluation metrics to assess high-level decision precision, safety, and diversity. Extensive experiments demonstrate that the proposed MAction-SocialNav achieves strong social reasoning performance while maintaining high efficiency, highlighting its potential for real-world human robot navigation. Compared with zero-shot GPT-4o and Claude, our model achieves substantially higher decision quality (APG: 0.595 vs. 0.000/0.025) and safety alignment (ER: 0.264 vs. 0.642/0.668), while maintaining real-time efficiency (1.524 FPS, over 3x faster).