HAMF: A Hybrid Attention-Mamba Framework for Joint Scene Context Understanding and Future Motion Representation Learning

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

To address feature degradation in scene context encoding for motion prediction in autonomous driving, this paper proposes a unified learning framework that jointly models scene understanding and future motion representation. Methodologically, it innovatively integrates attention mechanisms with the Mamba state-space model: historical trajectories, high-definition maps, and learnable future motion tokens are uniformly tokenized into 1D sequences; a hybrid encoder—comprising self-attention and cross-attention modules—fuses these inputs into joint contextual representations; and a Mamba-based decoder generates diverse, multimodal trajectory predictions. Evaluated on the Argoverse 2 benchmark, the approach achieves state-of-the-art performance, demonstrating balanced improvements in prediction accuracy (minADE/minFDE), output diversity (MR), and model efficiency (parameter count and inference latency).

Technology Category

Application Category

📝 Abstract

Motion forecasting represents a critical challenge in autonomous driving systems, requiring accurate prediction of surrounding agents' future trajectories. While existing approaches predict future motion states with the extracted scene context feature from historical agent trajectories and road layouts, they suffer from the information degradation during the scene feature encoding. To address the limitation, we propose HAMF, a novel motion forecasting framework that learns future motion representations with the scene context encoding jointly, to coherently combine the scene understanding and future motion state prediction. We first embed the observed agent states and map information into 1D token sequences, together with the target multi-modal future motion features as a set of learnable tokens. Then we design a unified Attention-based encoder, which synergistically combines self-attention and cross-attention mechanisms to model the scene context information and aggregate future motion features jointly. Complementing the encoder, we implement the Mamba module in the decoding stage to further preserve the consistency and correlations among the learned future motion representations, to generate the accurate and diverse final trajectories. Extensive experiments on Argoverse 2 benchmark demonstrate that our hybrid Attention-Mamba model achieves state-of-the-art motion forecasting performance with the simple and lightweight architecture.

Problem

Research questions and friction points this paper is trying to address.

Predicts future trajectories in autonomous driving systems

Combines scene context understanding with motion prediction

Addresses information degradation in scene feature encoding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid Attention-Mamba framework for motion forecasting

Unified Attention encoder for joint context modeling

Mamba module preserves motion representation consistency

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs