🤖 AI Summary
Existing trajectory prediction models neglect implicit visual cues in pedestrian motion—particularly social and behavioral information encoded in human pose. To address this, we propose Social-Pose, the first systematic pose encoder integrating both 2D and 3D human pose representations into trajectory forecasting frameworks. Leveraging attention mechanisms, it explicitly models inter-agent interactions and motion intent. Social-Pose is modular and compatible with diverse backbone architectures—including LSTM, GAN, MLP, and Transformer—enabling plug-and-play integration. Evaluated on Joint Track Auto, Human3.6M, Pedestrians and Cyclists in Road Traffic, and JRDB, it achieves substantial improvements in prediction accuracy, reducing average ADE by 12.7%–23.4%. The encoder demonstrates robustness to pose estimation noise, strong cross-scenario generalization, and practical efficacy validated in real-world robot navigation tasks.
📝 Abstract
Accurate human trajectory prediction is one of the most crucial tasks for autonomous driving, ensuring its safety. Yet, existing models often fail to fully leverage the visual cues that humans subconsciously communicate when navigating the space. In this work, we study the benefits of predicting human trajectories using human body poses instead of solely their Cartesian space locations in time. We propose `Social-pose', an attention-based pose encoder that effectively captures the poses of all humans in a scene and their social relations. Our method can be integrated into various trajectory prediction architectures. We have conducted extensive experiments on state-of-the-art models (based on LSTM, GAN, MLP, and Transformer), and showed improvements over all of them on synthetic (Joint Track Auto) and real (Human3.6M, Pedestrians and Cyclists in Road Traffic, and JRDB) datasets. We also explored the advantages of using 2D versus 3D poses, as well as the effect of noisy poses and the application of our pose-based predictor in robot navigation scenarios.