KEPT: Knowledge-Enhanced Prediction of Trajectories from Consecutive Driving Frames with Vision-Language Models

πŸ“… 2025-09-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address insufficient short-horizon trajectory prediction accuracy and poor interpretability in autonomous driving, this paper proposes KEPT, a knowledge-enhanced trajectory prediction framework. Methodologically, KEPT integrates vision-language models with domain-specific driving knowledge to construct a spatio-temporal-frequency joint video encoder; introduces retrieval-augmented (k-means + HNSW) and chain-of-thought prompting mechanisms incorporating interpretable kinematic and geometric planning constraints; and employs a three-stage fine-tuning strategy to progressively align spatial, motion, and temporal dynamics. On the nuScenes dataset, KEPT achieves an average L2 error of 0.70 m and collision rate of 0.21% under the NoAvg protocol, improving to 0.31 m and 0.07%, respectively, under TemAvgβ€”while maintaining retrieval latency below 1 ms. These results significantly surpass state-of-the-art methods, demonstrating superior prediction accuracy, robustness, and deployment feasibility.

Technology Category

Application Category

πŸ“ Abstract
Accurate short-horizon trajectory prediction is pivotal for safe and reliable autonomous driving, yet existing vision-language models (VLMs) often fail to effectively ground their reasoning in scene dynamics and domain knowledge. To address this challenge, this paper introduces KEPT, a knowledge-enhanced VLM framework that predicts ego trajectories directly from consecutive front-view driving frames. KEPT couples a temporal frequency-spatial fusion (TFSF) video encoder, trained via self-supervised learning with hard-negative mining, with a scalable k-means + HNSW retrieval stack that supplies scene-aligned exemplars. Retrieved priors are embedded into chain-of-thought (CoT) prompts with explicit planning constraints, while a triple-stage fine-tuning schedule incrementally aligns the language head to metric spatial cues, physically feasible motion, and temporally conditioned front-view planning. Evaluated on nuScenes dataset, KEPT achieves state-of-the-art performance across open-loop protocols: under NoAvg, it achieves 0.70m average L2 with a 0.21% collision rate; under TemAvg with lightweight ego status, it attains 0.31m average L2 and a 0.07% collision rate. Ablation studies show that all three fine-tuning stages contribute complementary benefits, and that using Top-2 retrieved exemplars yields the best accuracy-safety trade-off. The k-means-clustered HNSW index delivers sub-millisecond retrieval latency, supporting practical deployment. These results indicate that retrieval-augmented, CoT-guided VLMs offer a promising, data-efficient pathway toward interpretable and trustworthy autonomous driving.
Problem

Research questions and friction points this paper is trying to address.

Accurate short-horizon trajectory prediction for safe autonomous driving
Vision-language models failing to ground reasoning in scene dynamics
Lack of effective integration of domain knowledge in trajectory prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

TFSF video encoder with self-supervised learning
k-means HNSW retrieval for scene-aligned exemplars
triple-stage fine-tuning with explicit planning constraints
πŸ”Ž Similar Papers
No similar papers found.
Yujin Wang
Yujin Wang
Ph.D. Student, Tongji University
T
Tianyi Wang
Department of Civil, Architectural, and Environmental Engineering, The University of Texas at Austin, Austin, Texas, 78712, USA
Q
Quanfeng Liu
School of Architecture, The University of Texas at Austin, Austin, Texas, 78712, USA
W
Wenxian Fan
School of Automotive and Traffic Engineering, Wuhan University of Science and Technology, Wuhan, 430081, China
Junfeng Jiao
Junfeng Jiao
Associate Professor, Urban Information Lab, Texas Smart City, NSF NRT AI, UT Austin
AISmart CityUrban Informatics
Christian Claudel
Christian Claudel
UT Austin
Wireless sensor networkstransportation engineering
Y
Yunbing Yan
School of Automotive Studies, Tongji University, Shanghai, 201804, China
Bingzhao Gao
Bingzhao Gao
Professor, School of Automotive Studies, Tongji University
Jianqiang Wang
Jianqiang Wang
Associate Professor of Library and Information Studies, University at Buffalo
Information Retrievale-discovery
H
Hong Chen
School of Automotive Studies, Tongji University, Shanghai, 201804, China