🤖 AI Summary
To address feature degradation in scene context encoding for motion prediction in autonomous driving, this paper proposes a unified learning framework that jointly models scene understanding and future motion representation. Methodologically, it innovatively integrates attention mechanisms with the Mamba state-space model: historical trajectories, high-definition maps, and learnable future motion tokens are uniformly tokenized into 1D sequences; a hybrid encoder—comprising self-attention and cross-attention modules—fuses these inputs into joint contextual representations; and a Mamba-based decoder generates diverse, multimodal trajectory predictions. Evaluated on the Argoverse 2 benchmark, the approach achieves state-of-the-art performance, demonstrating balanced improvements in prediction accuracy (minADE/minFDE), output diversity (MR), and model efficiency (parameter count and inference latency).
📝 Abstract
Motion forecasting represents a critical challenge in autonomous driving systems, requiring accurate prediction of surrounding agents' future trajectories. While existing approaches predict future motion states with the extracted scene context feature from historical agent trajectories and road layouts, they suffer from the information degradation during the scene feature encoding. To address the limitation, we propose HAMF, a novel motion forecasting framework that learns future motion representations with the scene context encoding jointly, to coherently combine the scene understanding and future motion state prediction. We first embed the observed agent states and map information into 1D token sequences, together with the target multi-modal future motion features as a set of learnable tokens. Then we design a unified Attention-based encoder, which synergistically combines self-attention and cross-attention mechanisms to model the scene context information and aggregate future motion features jointly. Complementing the encoder, we implement the Mamba module in the decoding stage to further preserve the consistency and correlations among the learned future motion representations, to generate the accurate and diverse final trajectories. Extensive experiments on Argoverse 2 benchmark demonstrate that our hybrid Attention-Mamba model achieves state-of-the-art motion forecasting performance with the simple and lightweight architecture.