Pedestrian Intention Prediction via Vision-Language Foundation Models

๐Ÿ“… 2025-07-05
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Pedestrian crossing intention prediction is a critical safety challenge for autonomous driving; however, conventional vision-only approaches suffer from limitations in generalization, contextual modeling, and causal reasoning. This paper introduces, for the first time, vision-language foundation models (VLFMs) to this task, proposing a time-aware hierarchical prompt engineering framework. The method integrates visual frames, physical cues (e.g., pose, velocity), and ego-vehicle dynamics to construct multimodal contextual representations, and employs an automated prompt optimization mechanism to enhance prompt efficacy. Evaluated on JAAD, PIE, and FU-PIP benchmarks, our approach achieves up to 19.8% absolute accuracy improvement over prior methods; after automatic prompt optimization, performance further improves by 12.5%. These results significantly surpass existing vision-dominated approaches, demonstrating superior generalization capability and enhanced causal reasoningโ€”key attributes for robust real-world deployment.

Technology Category

Application Category

๐Ÿ“ Abstract
Prediction of pedestrian crossing intention is a critical function in autonomous vehicles. Conventional vision-based methods of crossing intention prediction often struggle with generalizability, context understanding, and causal reasoning. This study explores the potential of vision-language foundation models (VLFMs) for predicting pedestrian crossing intentions by integrating multimodal data through hierarchical prompt templates. The methodology incorporates contextual information, including visual frames, physical cues observations, and ego-vehicle dynamics, into systematically refined prompts to guide VLFMs effectively in intention prediction. Experiments were conducted on three common datasets-JAAD, PIE, and FU-PIP. Results demonstrate that incorporating vehicle speed, its variations over time, and time-conscious prompts significantly enhances the prediction accuracy up to 19.8%. Additionally, optimised prompts generated via an automatic prompt engineering framework yielded 12.5% further accuracy gains. These findings highlight the superior performance of VLFMs compared to conventional vision-based models, offering enhanced generalisation and contextual understanding for autonomous driving applications.
Problem

Research questions and friction points this paper is trying to address.

Predict pedestrian crossing intention for autonomous vehicles
Improve generalizability and context understanding in prediction
Enhance accuracy using vision-language models and optimized prompts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses vision-language models for pedestrian intention prediction
Integrates multimodal data via hierarchical prompt templates
Enhances accuracy with automatic prompt engineering framework
๐Ÿ”Ž Similar Papers
No similar papers found.
M
Mohsen Azarmi
Institute for Transport Studies, Faculty of Environment, Computer Vision and Machine Learning Group, University of Leeds, LS2 9JT, United Kingdom
Mahdi Rezaei
Mahdi Rezaei
Associate Professor, University of Leeds
AIComputer VisionMachine LearningAutonomous VehiclesLarge Language Models
H
He Wang
AI Centre, Department of Computer Science, University College London, London, WC1E 6BT, United Kingdom