KEPT: Knowledge-Enhanced Prediction of Trajectories from Consecutive Driving Frames with Vision-Language Models

📅 2025-09-02

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

To address insufficient short-horizon trajectory prediction accuracy and poor interpretability in autonomous driving, this paper proposes KEPT, a knowledge-enhanced trajectory prediction framework. Methodologically, KEPT integrates vision-language models with domain-specific driving knowledge to construct a spatio-temporal-frequency joint video encoder; introduces retrieval-augmented (k-means + HNSW) and chain-of-thought prompting mechanisms incorporating interpretable kinematic and geometric planning constraints; and employs a three-stage fine-tuning strategy to progressively align spatial, motion, and temporal dynamics. On the nuScenes dataset, KEPT achieves an average L2 error of 0.70 m and collision rate of 0.21% under the NoAvg protocol, improving to 0.31 m and 0.07%, respectively, under TemAvg—while maintaining retrieval latency below 1 ms. These results significantly surpass state-of-the-art methods, demonstrating superior prediction accuracy, robustness, and deployment feasibility.

Technology Category

Application Category

📝 Abstract

Accurate short-horizon trajectory prediction is pivotal for safe and reliable autonomous driving, yet existing vision-language models (VLMs) often fail to effectively ground their reasoning in scene dynamics and domain knowledge. To address this challenge, this paper introduces KEPT, a knowledge-enhanced VLM framework that predicts ego trajectories directly from consecutive front-view driving frames. KEPT couples a temporal frequency-spatial fusion (TFSF) video encoder, trained via self-supervised learning with hard-negative mining, with a scalable k-means + HNSW retrieval stack that supplies scene-aligned exemplars. Retrieved priors are embedded into chain-of-thought (CoT) prompts with explicit planning constraints, while a triple-stage fine-tuning schedule incrementally aligns the language head to metric spatial cues, physically feasible motion, and temporally conditioned front-view planning. Evaluated on nuScenes dataset, KEPT achieves state-of-the-art performance across open-loop protocols: under NoAvg, it achieves 0.70m average L2 with a 0.21% collision rate; under TemAvg with lightweight ego status, it attains 0.31m average L2 and a 0.07% collision rate. Ablation studies show that all three fine-tuning stages contribute complementary benefits, and that using Top-2 retrieved exemplars yields the best accuracy-safety trade-off. The k-means-clustered HNSW index delivers sub-millisecond retrieval latency, supporting practical deployment. These results indicate that retrieval-augmented, CoT-guided VLMs offer a promising, data-efficient pathway toward interpretable and trustworthy autonomous driving.

Problem

Research questions and friction points this paper is trying to address.

Accurate short-horizon trajectory prediction for safe autonomous driving

Vision-language models failing to ground reasoning in scene dynamics

Lack of effective integration of domain knowledge in trajectory prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

TFSF video encoder with self-supervised learning

k-means HNSW retrieval for scene-aligned exemplars

triple-stage fine-tuning with explicit planning constraints

🔎 Similar Papers

No similar papers found.