🤖 AI Summary
To address insufficient safety and efficiency of robot navigation in densely interactive environments (e.g., corridors, furniture-cluttered spaces), this paper proposes a heterogeneous spatiotemporal graph modeling framework that explicitly captures dynamic couplings among humans, robots, and obstacles. Methodologically, we design a Graph Transformer architecture integrating multi-head attention, recurrent temporal modeling, and Proximal Policy Optimization (PPO)-based reinforcement learning, augmented with multi-modal perception from LiDAR and RGB-D sensors. Our key contributions are the first formulation of a heterogeneous spatiotemporal graph representation and its end-to-end learnable optimization, enabling zero-shot generalization across varying scene densities. Experiments demonstrate that our approach achieves significantly higher navigation success rates and path efficiency than state-of-the-art methods in both simulation and real-world robotic platforms. Moreover, it improves zero-shot transfer performance by 32% on average and reduces collision rates by 41%.
📝 Abstract
We study the problem of robot navigation in dense and interactive crowds with environmental constraints such as corridors and furniture. Previous methods fail to consider all types of interactions among agents and obstacles, leading to unsafe and inefficient robot paths. In this article, we leverage a graph-based representation of crowded and constrained scenarios and propose a structured framework to learn robot navigation policies with deep reinforcement learning. We first split the representations of different components in the environment and propose a heterogeneous spatio-temporal (st) graph to model distinct interactions among humans, robots, and obstacles. Based on the heterogeneous st-graph, we propose HEIGHT, a novel navigation policy network architecture with different components to capture heterogeneous interactions among entities through space and time. HEIGHT utilizes attention mechanisms to prioritize important interactions and a recurrent network to track changes in the dynamic scene over time, encouraging the robot to avoid collisions adaptively. Through extensive simulation and real-world experiments, we demonstrate that HEIGHT outperforms state-of-the-art baselines in terms of success and efficiency in challenging navigation scenarios. Furthermore, we demonstrate that our pipeline achieves better zero-shot generalization capability than previous works when the densities of humans and obstacles change. More videos are available at https://sites.google.com/view/crowdnav-height/home.