🤖 AI Summary
This study systematically evaluates the out-of-distribution generalization capability of three mainstream trajectory prediction architectures—graph neural networks (GNNs), Transformers, and CNN-based models—under cross-dataset transfer between Argoverse 2 and Waymo Open Motion. It investigates how architectural inductive bias, training data scale, and data augmentation jointly affect robustness. Method: We propose a multi-model comparative experimental framework integrating cross-domain transfer evaluation and fine-grained error attribution analysis. Contribution/Results: Counterintuitively, compact models with strong inductive bias achieve superior cross-domain generalization under limited training data; scaling up training data degrades transfer performance to small target domains, challenging prevailing benchmarking practices. In the A2→WO setting, the smallest model reduces mean ADE by 12.3%; in WO→A2, all models suffer significant degradation, yet high-bias models retain an 8.7% ADE advantage over the second-best performer. These findings underscore the critical role of inductive bias—and caution against data-scale-centric optimization—in safety-critical motion forecasting.
📝 Abstract
We study the Out-of-Distribution (OoD) generalization ability of three SotA trajectory prediction models with comparable In-Distribution (ID) performance but different model designs. We investigate the influence of inductive bias, size of training data and data augmentation strategy by training the models on Argoverse 2 (A2) and testing on Waymo Open Motion (WO) and vice versa. We find that the smallest model with highest inductive bias exhibits the best OoD generalization across different augmentation strategies when trained on the smaller A2 dataset and tested on the large WO dataset. In the converse setting, training all models on the larger WO dataset and testing on the smaller A2 dataset, we find that all models generalize poorly, even though the model with the highest inductive bias still exhibits the best generalization ability. We discuss possible reasons for this surprising finding and draw conclusions about the design and test of trajectory prediction models and benchmarks.