NavThinker: Action-Conditioned World Models for Coupled Prediction and Planning in Social Navigation

📅 2026-03-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the coupled prediction-planning challenge in robot navigation within dynamic human environments by proposing the NavThinker framework. NavThinker leverages an action-conditioned world model to autoregressively predict future scene geometry and pedestrian trajectories in the feature space of Depth Anything V2, enabling joint inference through a multi-head decoder. Integrated with online reinforcement learning via DD-PPO and shaped social rewards, the framework achieves proactive social navigation. Notably, this is the first approach to synergistically combine action-conditioned world models with reinforcement learning for social navigation, attaining state-of-the-art navigation success rates on Social-HM3D, demonstrating zero-shot transfer to Social-MP3D, and validating its generalization and practicality through successful deployment on the Unitree Go2 quadruped robot.

Technology Category

Application Category

📝 Abstract
Social navigation requires robots to act safely in dynamic human environments. Effective behavior demands thinking ahead: reasoning about how the scene and pedestrians evolve under different robot actions rather than reacting to current observations alone. This creates a coupled prediction-planning challenge, where robot actions and human motion mutually influence each other. To address this challenge, we propose NavThinker, a future-aware framework that couples an action-conditioned world model with on-policy reinforcement learning. The world model operates in the Depth Anything V2 patch feature space and performs autoregressive prediction of future scene geometry and human motion; multi-head decoders then produce future depth maps and human trajectories, yielding a future-aware state aligned with traversability and interaction risk. Crucially, we train the policy with DD-PPO while injecting world-model think-ahead signals via: (i) action-conditioned future features fused into the current observation embedding and (ii) social reward shaping from predicted human trajectories. Experiments on single- and multi-robot Social-HM3D show state-of-the-art navigation success, with zero-shot transfer to Social-MP3D and real-world deployment on a Unitree Go2, validating generalization and practical applicability. Webpage: https://github.com/hutslib/NavThinker.
Problem

Research questions and friction points this paper is trying to address.

social navigation
coupled prediction-planning
action-conditioned world models
human-robot interaction
future-aware reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

action-conditioned world model
coupled prediction-planning
future-aware navigation
social reward shaping
Depth Anything V2
Tianshuai Hu
Tianshuai Hu
Ph.D. student in HKUST
RoboticsAutonomous Driving
Zeying Gong
Zeying Gong
The Hong Kong University of Science and Technology (Guangzhou)
ForecastingEmbodied AI
Lingdong Kong
Lingdong Kong
National University of Singapore
Computer VisionDeep Learning
X
XiaoDong Mei
The Hong Kong University of Science and Technology
Y
Yiyi Ding
The Hong Kong University of Science and Technology (Guangzhou)
Q
Qi Zeng
The Hong Kong University of Science and Technology (Guangzhou)
A
Ao Liang
University of Chinese Academy of Sciences
Rong Li
Rong Li
PhD student, HKUST (GZ)
Computer VisionEmbodied AI
Y
Yangyi Zhong
The Hong Kong University of Science and Technology (Guangzhou)
Junwei Liang
Junwei Liang
Assistant Professor, HKUST (Guangzhou) | CSE, HKUST | Ph.D. @CMU
Computer VisionRoboticsEmbodied AITrajectory Prediction