🤖 AI Summary
Dense crowd trajectory prediction faces dual challenges: complex pairwise spatiotemporal interactions and heterogeneous group dynamics. To address these, we propose a group-aware multi-scale modeling framework. First, we construct a multi-scale hypergraph to explicitly encode social group associations at varying granularities. Second, we integrate hypergraph spectral convolution—based on random-walk transition probabilities—with a spatiotemporal Transformer to achieve heterogeneous alignment between pairwise individual interactions and collective group coordination. Third, we introduce a multimodal Transformer fusion network to enhance joint intention-trajectory reasoning. Our method achieves significant improvements over state-of-the-art approaches on five mainstream pedestrian datasets, demonstrating both effectiveness and generalizability of multi-scale group-structure modeling for intent recognition and trajectory forecasting in dense scenarios.
📝 Abstract
Predicting crowded intents and trajectories is crucial in varouls real-world applications, including service robots and autonomous vehicles. Understanding environmental dynamics is challenging, not only due to the complexities of modeling pair-wise spatial and temporal interactions but also the diverse influence of group-wise interactions. To decode the comprehensive pair-wise and group-wise interactions in crowded scenarios, we introduce Hyper-STTN, a Hypergraph-based Spatial-Temporal Transformer Network for crowd trajectory prediction. In Hyper-STTN, crowded group-wise correlations are constructed using a set of multi-scale hypergraphs with varying group sizes, captured through random-walk robability-based hypergraph spectral convolution. Additionally, a spatial-temporal transformer is adapted to capture pedestrians' pair-wise latent interactions in spatial-temporal dimensions. These heterogeneous group-wise and pair-wise are then fused and aligned though a multimodal transformer network. Hyper-STTN outperformes other state-of-the-art baselines and ablation models on 5 real-world pedestrian motion datasets.