GContextFormer: A global context-aware hybrid multi-head attention approach with scaled additive aggregation for multimodal trajectory prediction

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Multimodal trajectory prediction faces two key challenges: (1) heavy reliance on high-definition (HD) maps incurs high deployment costs and degrades robustness; (2) map-free approaches lack global scene context, causing pairwise attention mechanisms to overfit straight-line motion patterns and impair modeling of transitional behaviors and motion intent alignment. This paper proposes a map-free, global-context-aware hybrid attention model. Its core contributions are: (1) scene-level intent prior modeling coupled with hierarchical interaction reasoning; (2) scaled additive aggregation and dual-path cross-attention to decouple mode suppression from enhanced representation of transitional dynamics; and (3) an encoder–decoder architecture integrating mode embeddings, neighbor-context enhancement, and gated fusion. Evaluated on the TOD-VT highway ramp dataset, the method significantly improves prediction accuracy in high-curvature and transitional regions, while demonstrating strong robustness and modular extensibility.

Technology Category

Application Category

📝 Abstract

Multimodal trajectory prediction generates multiple plausible future trajectories to address vehicle motion uncertainty from intention ambiguity and execution variability. However, HD map-dependent models suffer from costly data acquisition, delayed updates, and vulnerability to corrupted inputs, causing prediction failures. Map-free approaches lack global context, with pairwise attention over-amplifying straight patterns while suppressing transitional patterns, resulting in motion-intention misalignment. This paper proposes GContextFormer, a plug-and-play encoder-decoder architecture with global context-aware hybrid attention and scaled additive aggregation achieving intention-aligned multimodal prediction without map reliance. The Motion-Aware Encoder builds scene-level intention prior via bounded scaled additive aggregation over mode-embedded trajectory tokens and refines per-mode representations under shared global context, mitigating inter-mode suppression and promoting intention alignment. The Hierarchical Interaction Decoder decomposes social reasoning into dual-pathway cross-attention: a standard pathway ensures uniform geometric coverage over agent-mode pairs while a neighbor-context-enhanced pathway emphasizes salient interactions, with gating module mediating their contributions to maintain coverage-focus balance. Experiments on eight highway-ramp scenarios from TOD-VT dataset show GContextFormer outperforms state-of-the-art baselines. Compared to existing transformer models, GContextFormer achieves greater robustness and concentrated improvements in high-curvature and transition zones via spatial distributions. Interpretability is achieved through motion mode distinctions and neighbor context modulation exposing reasoning attribution. The modular architecture supports extensibility toward cross-domain multimodal reasoning tasks. Source: https://fenghy-chen.github.io/sources/.

Problem

Research questions and friction points this paper is trying to address.

Multimodal trajectory prediction suffers from costly HD map dependencies

Map-free approaches lack global context causing motion-intention misalignment

Existing methods over-amplify straight patterns while suppressing transitional patterns

Innovation

Methods, ideas, or system contributions that make the work stand out.

Global context-aware hybrid attention for multimodal prediction

Scaled additive aggregation to mitigate inter-mode suppression

Dual-pathway cross-attention decoder balancing coverage and focus

🔎 Similar Papers

Hierarchical Light Transformer Ensembles for Multimodal Trajectory Forecasting