Pedestrian Intention Prediction via Vision-Language Foundation Models

📅 2025-07-05

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Pedestrian crossing intention prediction is a critical safety challenge for autonomous driving; however, conventional vision-only approaches suffer from limitations in generalization, contextual modeling, and causal reasoning. This paper introduces, for the first time, vision-language foundation models (VLFMs) to this task, proposing a time-aware hierarchical prompt engineering framework. The method integrates visual frames, physical cues (e.g., pose, velocity), and ego-vehicle dynamics to construct multimodal contextual representations, and employs an automated prompt optimization mechanism to enhance prompt efficacy. Evaluated on JAAD, PIE, and FU-PIP benchmarks, our approach achieves up to 19.8% absolute accuracy improvement over prior methods; after automatic prompt optimization, performance further improves by 12.5%. These results significantly surpass existing vision-dominated approaches, demonstrating superior generalization capability and enhanced causal reasoning—key attributes for robust real-world deployment.

Technology Category

Application Category

📝 Abstract

Prediction of pedestrian crossing intention is a critical function in autonomous vehicles. Conventional vision-based methods of crossing intention prediction often struggle with generalizability, context understanding, and causal reasoning. This study explores the potential of vision-language foundation models (VLFMs) for predicting pedestrian crossing intentions by integrating multimodal data through hierarchical prompt templates. The methodology incorporates contextual information, including visual frames, physical cues observations, and ego-vehicle dynamics, into systematically refined prompts to guide VLFMs effectively in intention prediction. Experiments were conducted on three common datasets-JAAD, PIE, and FU-PIP. Results demonstrate that incorporating vehicle speed, its variations over time, and time-conscious prompts significantly enhances the prediction accuracy up to 19.8%. Additionally, optimised prompts generated via an automatic prompt engineering framework yielded 12.5% further accuracy gains. These findings highlight the superior performance of VLFMs compared to conventional vision-based models, offering enhanced generalisation and contextual understanding for autonomous driving applications.

Problem

Research questions and friction points this paper is trying to address.

Predict pedestrian crossing intention for autonomous vehicles

Improve generalizability and context understanding in prediction

Enhance accuracy using vision-language models and optimized prompts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses vision-language models for pedestrian intention prediction

Integrates multimodal data via hierarchical prompt templates

Enhances accuracy with automatic prompt engineering framework

🔎 Similar Papers

No similar papers found.