🤖 AI Summary
Existing approaches to multimodal motion prediction suffer from mode collapse and unreliable confidence ranking due to training on data containing only a single future trajectory per scenario. To address these limitations, this work proposes the Mode-as-Sequence framework, which models unordered multimodal trajectories as an ordered sequence and explicitly captures inter-modal dependencies through recursive and parallel decoding mechanisms. The framework further introduces an Early-Match-Take-All (EMTA) loss and a lightweight ranking regularizer to jointly optimize diversity, prediction accuracy, and confidence calibration. Evaluated on large-scale benchmarks such as Waymo, the method achieves significant improvements in best-of-K accuracy and ranking performance, earning first place in both the 2024 LiDAR-free Motion Prediction Challenge and the 2025 Interaction Prediction Challenge.
📝 Abstract
Multimodal motion forecasting is inherently under-supervised: each training scene provides only one realized future, yet multiple plausible futures exist. This sparse supervision often leads to mode collapse (redundant hypotheses and insufficient mode coverage) and unreliable confidence ranking when predicting a small set of trajectories. We propose Mode-as-Sequence, a unified decoding framework that translates an unordered mode set into an ordered mode sequence and explicitly models mode-to-mode dependency. Under this framework, we develop two complementary instantiations. ModeSeq performs recurrent mode decoding, where each mode is generated conditioned on the previously generated modes, encouraging diverse, non-redundant hypotheses with calibrated confidence ordering. To remove the mode-by-mode autoregressive bottleneck, we further propose Parallel ModeSeq, which preserves the same causal dependency using masked mode-to-mode self-attention while decoding all modes in a single forward pass, enabling efficient large-$K$ inference and scalable joint-scene prediction. To learn representative modes and calibrated confidence under sparse labels, we introduce Early-Match-Take-All (EMTA) and its joint-scene extension MA-EMTA, together with a lightweight ranking regularizer that reduces confidence inversions. Extensive experiments on large-scale benchmarks demonstrate consistent improvements in both ranking-oriented metrics and best-of-K accuracy across datasets, horizons, and object types. In the Waymo Open Dataset challenges, ModeSeq achieves 1st place in the 2024 LiDAR-free motion prediction track, and Parallel ModeSeq achieves 1st place in the 2025 Interaction Prediction Challenge, validating the effectiveness of Mode-as-Sequence for both accuracy and efficiency.