๐ค AI Summary
Pedestrian crossing intention prediction is a critical safety challenge for autonomous driving; however, conventional vision-only approaches suffer from limitations in generalization, contextual modeling, and causal reasoning. This paper introduces, for the first time, vision-language foundation models (VLFMs) to this task, proposing a time-aware hierarchical prompt engineering framework. The method integrates visual frames, physical cues (e.g., pose, velocity), and ego-vehicle dynamics to construct multimodal contextual representations, and employs an automated prompt optimization mechanism to enhance prompt efficacy. Evaluated on JAAD, PIE, and FU-PIP benchmarks, our approach achieves up to 19.8% absolute accuracy improvement over prior methods; after automatic prompt optimization, performance further improves by 12.5%. These results significantly surpass existing vision-dominated approaches, demonstrating superior generalization capability and enhanced causal reasoningโkey attributes for robust real-world deployment.
๐ Abstract
Prediction of pedestrian crossing intention is a critical function in autonomous vehicles. Conventional vision-based methods of crossing intention prediction often struggle with generalizability, context understanding, and causal reasoning. This study explores the potential of vision-language foundation models (VLFMs) for predicting pedestrian crossing intentions by integrating multimodal data through hierarchical prompt templates. The methodology incorporates contextual information, including visual frames, physical cues observations, and ego-vehicle dynamics, into systematically refined prompts to guide VLFMs effectively in intention prediction. Experiments were conducted on three common datasets-JAAD, PIE, and FU-PIP. Results demonstrate that incorporating vehicle speed, its variations over time, and time-conscious prompts significantly enhances the prediction accuracy up to 19.8%. Additionally, optimised prompts generated via an automatic prompt engineering framework yielded 12.5% further accuracy gains. These findings highlight the superior performance of VLFMs compared to conventional vision-based models, offering enhanced generalisation and contextual understanding for autonomous driving applications.