๐ค AI Summary
This work challenges the prevailing assumption that higher frame rates universally benefit end-to-end (E2E) autonomous driving trajectory prediction, demonstrating instead that excessive temporal resolution can introduce redundancy and noiseโparticularly detrimental to capacity-constrained models. For the first time, temporal sampling frequency is treated as an explicit training variable. The authors construct multi-frame-rate training sets via temporal subsampling and systematically evaluate the impact of varying frequencies on diverse E2E architectures under a fixed protocol across Waymo, nuScenes, and PAVE datasets. Results reveal a non-monotonic relationship between sampling frequency and model performance: smaller models achieve optimal 3-second average displacement error (ADE) at medium-to-low frame rates, whereas larger models such as AutoVLA perform best at the highest frame rate. These findings underscore the necessity of co-adapting sampling frequency to both model capacity and dataset characteristics.
๐ Abstract
End to end (E2E) autonomous driving trajectory prediction is often trained with camera frames sampled at the highest available temporal frequency, assuming that denser sampling improves performance. We question this assumption by treating temporal sampling frequency as an explicit training set design variable. Starting from high frequency E2E driving datasets, we construct frequency sweep training sets by temporally subsampling camera frames along each trajectory. For each model dataset pair, we train and evaluate the same model under a fixed protocol, so the frequency response reflects how prediction performance changes with sampling frequency. We analyze this response from a capacity aware perspective. Sparse sampling may miss driving relevant cues, while dense sampling may add redundant visual content and off manifold noise. For finite capacity models, this can create a driving irrelevant capacity burden. We evaluate three smaller E2E models and a larger VLA style AutoVLA model on Waymo, nuScenes, and PAVE. Results show model and dataset dependent frequency responses. Smaller E2E models often show non monotonic or near plateau trends and achieve their best 3 second ADE at lower or intermediate frequencies. In contrast, AutoVLA achieves its best 3 second ADE and FDE at the highest evaluated frequency on all three datasets. Iteration matched controls suggest that the advantage of lower or intermediate frequencies for smaller models is not explained only by unequal training update counts. These findings show that temporal sampling frequency should be reported and tuned, rather than fixed to the highest available value.