🤖 AI Summary
Existing approaches struggle to jointly and efficiently address pose and trajectory prediction in human motion forecasting, often exhibiting limited performance across benchmarks. This work proposes a concise, end-to-end Transformer architecture that stacks self-attention modules to simultaneously model the spatial dependencies within body poses and the temporal dynamics of motion sequences. For the first time, a single, task-agnostic model unifies pose prediction, trajectory forecasting, and their joint estimation without requiring task-specific modifications. The proposed method achieves state-of-the-art performance across multiple benchmarks—including Human3.6M, AMASS, ETH-UCY, and 3DPW—demonstrating significantly improved cross-task generalization and prediction accuracy.
📝 Abstract
Human motion prediction combines the tasks of trajectory forecasting and human pose prediction. For each of the two tasks, specialized models have been developed. Combining these models for holistic human motion prediction is non-trivial, and recent methods have struggled to compete on established benchmarks for individual tasks. To address this, we propose a simple yet effective transformer-based model for human motion prediction. The model employs a stack of self-attention modules to effectively capture both spatial dependencies within a pose and temporal relationships across a motion sequence. This simple, streamlined, end-to-end model is sufficiently versatile to handle pose-only, trajectory-only, and combined prediction tasks without task-specific modifications. We demonstrate that this approach achieves state-of-the-art results across all tasks through extensive experiments on a wide range of benchmark datasets, including Human3.6M, AMASS, ETH-UCY, and 3DPW.