HAMF: A Hybrid Attention-Mamba Framework for Joint Scene Context Understanding and Future Motion Representation Learning

📅 2025-05-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address feature degradation in scene context encoding for motion prediction in autonomous driving, this paper proposes a unified learning framework that jointly models scene understanding and future motion representation. Methodologically, it innovatively integrates attention mechanisms with the Mamba state-space model: historical trajectories, high-definition maps, and learnable future motion tokens are uniformly tokenized into 1D sequences; a hybrid encoder—comprising self-attention and cross-attention modules—fuses these inputs into joint contextual representations; and a Mamba-based decoder generates diverse, multimodal trajectory predictions. Evaluated on the Argoverse 2 benchmark, the approach achieves state-of-the-art performance, demonstrating balanced improvements in prediction accuracy (minADE/minFDE), output diversity (MR), and model efficiency (parameter count and inference latency).

Technology Category

Application Category

📝 Abstract
Motion forecasting represents a critical challenge in autonomous driving systems, requiring accurate prediction of surrounding agents' future trajectories. While existing approaches predict future motion states with the extracted scene context feature from historical agent trajectories and road layouts, they suffer from the information degradation during the scene feature encoding. To address the limitation, we propose HAMF, a novel motion forecasting framework that learns future motion representations with the scene context encoding jointly, to coherently combine the scene understanding and future motion state prediction. We first embed the observed agent states and map information into 1D token sequences, together with the target multi-modal future motion features as a set of learnable tokens. Then we design a unified Attention-based encoder, which synergistically combines self-attention and cross-attention mechanisms to model the scene context information and aggregate future motion features jointly. Complementing the encoder, we implement the Mamba module in the decoding stage to further preserve the consistency and correlations among the learned future motion representations, to generate the accurate and diverse final trajectories. Extensive experiments on Argoverse 2 benchmark demonstrate that our hybrid Attention-Mamba model achieves state-of-the-art motion forecasting performance with the simple and lightweight architecture.
Problem

Research questions and friction points this paper is trying to address.

Predicts future trajectories in autonomous driving systems
Combines scene context understanding with motion prediction
Addresses information degradation in scene feature encoding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid Attention-Mamba framework for motion forecasting
Unified Attention encoder for joint context modeling
Mamba module preserves motion representation consistency
🔎 Similar Papers
No similar papers found.
X
Xiaodong Mei
Hong Kong University of Science and Technology, Hong Kong SAR, China
S
Sheng Wang
Hong Kong University of Science and Technology, Hong Kong SAR, China
Jie Cheng
Jie Cheng
Institute of Automation, Chinese Academy of Sciences
Reinforcement Learning
Yingbing Chen
Yingbing Chen
HKUST, IIP, Phd
motion planingroboticsmachine learning technologies.
D
Dan Xu
Hong Kong University of Science and Technology, Hong Kong SAR, China