🤖 AI Summary
This work addresses the challenge that existing motion prediction methods struggle to simultaneously achieve high accuracy, interpretability, and trajectory diversity while often suffering from query collapse. To overcome these limitations, the authors propose an end-to-end differentiable framework that constructs an interpretable “motion library” via contrastive learning and introduces an Anchor Retrieval Layer to dynamically retrieve explicit motion priors. The architecture integrates a dual-level gated cross-attention mechanism, a Straight-Through Gumbel-Softmax estimator, and a DETR-style decoder, enabling multimodal and diverse trajectory prediction while preserving gradient flow throughout training. Evaluated on the Argoverse 2 and Waymo Open Motion datasets, the model achieves competitive prediction performance and demonstrates significant improvements in both interpretability and trajectory diversity.
📝 Abstract
Motion forecasting often requires trading interpretability for predictive accuracy. Standard anchor-based architectures rely on opaque latent queries that are highly prone to latent collapse, or naive trajectory sampling that limits multi-modal diversity. We propose an end-to-end differentiable framework that grounds predictions in a comprehensive "motion bank", a structured embedding space of physically realizable trajectories constructed via contrastive learning. Rather than regressing paths from a blank slate, our architecture dynamically retrieves explicit motion priors using a novel Anchor Retrieval Layer. This module adapts orthogonally initialized queries via a Dual-Level Gated Cross-Attention mechanism and executes discrete trajectory selection using a Straight-Through Gumbel-Softmax estimator to preserve continuous gradient flow. The retrieved semantically grounded anchors are then geometrically refined by a DETR-style decoder, optimized jointly with a Winner-Takes-All (WTA) kinematic Gaussian Mixture Model (GMM), a latent diversity penalty, and a soft-min weighted endpoint loss. By strictly conditioning the decoding phase on diverse, interpretable motion primitives, our approach eliminates the "black box" of standard latent queries while achieving competitive multi-modal accuracy on the Argoverse 2 and Waymo Open Motion datasets. Code is available at: https://github.com/abviv/recall2predict