π€ AI Summary
To address insufficient short-horizon trajectory prediction accuracy and poor interpretability in autonomous driving, this paper proposes KEPT, a knowledge-enhanced trajectory prediction framework. Methodologically, KEPT integrates vision-language models with domain-specific driving knowledge to construct a spatio-temporal-frequency joint video encoder; introduces retrieval-augmented (k-means + HNSW) and chain-of-thought prompting mechanisms incorporating interpretable kinematic and geometric planning constraints; and employs a three-stage fine-tuning strategy to progressively align spatial, motion, and temporal dynamics. On the nuScenes dataset, KEPT achieves an average L2 error of 0.70 m and collision rate of 0.21% under the NoAvg protocol, improving to 0.31 m and 0.07%, respectively, under TemAvgβwhile maintaining retrieval latency below 1 ms. These results significantly surpass state-of-the-art methods, demonstrating superior prediction accuracy, robustness, and deployment feasibility.
π Abstract
Accurate short-horizon trajectory prediction is pivotal for safe and reliable autonomous driving, yet existing vision-language models (VLMs) often fail to effectively ground their reasoning in scene dynamics and domain knowledge. To address this challenge, this paper introduces KEPT, a knowledge-enhanced VLM framework that predicts ego trajectories directly from consecutive front-view driving frames. KEPT couples a temporal frequency-spatial fusion (TFSF) video encoder, trained via self-supervised learning with hard-negative mining, with a scalable k-means + HNSW retrieval stack that supplies scene-aligned exemplars. Retrieved priors are embedded into chain-of-thought (CoT) prompts with explicit planning constraints, while a triple-stage fine-tuning schedule incrementally aligns the language head to metric spatial cues, physically feasible motion, and temporally conditioned front-view planning. Evaluated on nuScenes dataset, KEPT achieves state-of-the-art performance across open-loop protocols: under NoAvg, it achieves 0.70m average L2 with a 0.21% collision rate; under TemAvg with lightweight ego status, it attains 0.31m average L2 and a 0.07% collision rate. Ablation studies show that all three fine-tuning stages contribute complementary benefits, and that using Top-2 retrieved exemplars yields the best accuracy-safety trade-off. The k-means-clustered HNSW index delivers sub-millisecond retrieval latency, supporting practical deployment. These results indicate that retrieval-augmented, CoT-guided VLMs offer a promising, data-efficient pathway toward interpretable and trustworthy autonomous driving.