🤖 AI Summary
Current trajectory prediction models lack standardized evaluation frameworks, particularly in modeling heterogeneous traffic scenarios, multi-agent joint prediction, and robustness analysis. Method: This paper introduces the first integrated training–evaluation platform, establishing a unified benchmarking framework that supports multi-dataset interfaces, consistent training/evaluation protocols, and explicit modeling of complex agent interactions. Contribution/Results: We systematically uncover critical deficiencies overlooked by conventional evaluation—namely, insufficient modeling of multi-agent dynamic coupling, sensitivity to distributional shift, and vulnerability to adversarial perturbations. Comprehensive experiments demonstrate fundamental limitations of state-of-the-art models in interaction-aware prediction and out-of-distribution generalization. Our work shifts the evaluation paradigm from static leaderboard ranking toward deep behavioral insight and mechanistic analysis, thereby establishing a new standard for trustworthy assessment of autonomous driving prediction models.
📝 Abstract
While trajectory prediction plays a critical role in enabling safe and effective path-planning in automated vehicles, standardized practices for evaluating such models remain underdeveloped. Recent efforts have aimed to unify dataset formats and model interfaces for easier comparisons, yet existing frameworks often fall short in supporting heterogeneous traffic scenarios, joint prediction models, or user documentation. In this work, we introduce STEP -- a new benchmarking framework that addresses these limitations by providing a unified interface for multiple datasets, enforcing consistent training and evaluation conditions, and supporting a wide range of prediction models. We demonstrate the capabilities of STEP in a number of experiments which reveal 1) the limitations of widely-used testing procedures, 2) the importance of joint modeling of agents for better predictions of interactions, and 3) the vulnerability of current state-of-the-art models against both distribution shifts and targeted attacks by adversarial agents. With STEP, we aim to shift the focus from the ``leaderboard'' approach to deeper insights about model behavior and generalization in complex multi-agent settings.