🤖 AI Summary
This work addresses the longstanding trade-off between modeling capacity and computational efficiency in trajectory prediction. Existing approaches struggle to simultaneously capture long-range dependencies and local dynamics—attention mechanisms suffer from high complexity, while recurrent models often fail to balance both aspects effectively. To overcome this, we propose FoSS, a novel dual-branch framework that uniquely integrates Fourier spectral decomposition with linear state space models (SSMs). The frequency-domain branch efficiently models global intent and local variations, while the time-domain branch preserves long-range context with linear complexity. Cross-attention fuses features from both domains to generate multimodal trajectories. Our designed Coarse2Fine-SSM and SpecEvolve-SSM enable O(N) spectral refinement, complemented by learnable queries and a weighted fusion head to effectively represent motion uncertainty. The method achieves state-of-the-art performance on Argoverse 1 and 2, with 22.5% fewer computations and over 40% fewer parameters.
📝 Abstract
Accurate trajectory prediction is vital for safe autonomous driving, yet existing approaches struggle to balance modeling power and computational efficiency. Attention-based architectures incur quadratic complexity with increasing agents, while recurrent models struggle to capture long-range dependencies and fine-grained local dynamics. Building upon this, we present FoSS, a dual-branch framework that unifies frequency-domain reasoning with linear-time sequence modeling. The frequency-domain branch performs a discrete Fourier transform to decompose trajectories into amplitude components encoding global intent and phase components capturing local variations, followed by a progressive helix reordering module that preserves spectral order; two selective state-space submodules, Coarse2Fine-SSM and SpecEvolve-SSM, refine spectral features with O(N) complexity. In parallel, a time-domain dynamic selective SSM reconstructs self-attention behavior in linear time to retain long-range temporal context. A cross-attention layer fuses temporal and spectral representations, while learnable queries generate multiple candidate trajectories, and a weighted fusion head expresses motion uncertainty. Experiments on Argoverse 1 and Argoverse 2 benchmarks demonstrate that FoSS achieves state-of-the-art accuracy while reducing computation by 22.5% and parameters by over 40%. Comprehensive ablations confirm the necessity of each component.