🤖 AI Summary
To address safety risks in autonomous driving caused by inaccurate pedestrian crossing-intention prediction in urban environments, this paper proposes a progressive multi-context fusion framework. Methodologically, it pioneers the joint integration of four contextual dimensions—pedestrian behavior, environmental state, localization information, and vehicle motion—and introduces reciprocal intra-context attention and guidance-based inter-context attention mechanisms. A global [CLS] token is further incorporated to enable deep interaction and efficient aggregation of heterogeneous information. Built upon the Transformer architecture, the model balances modeling capacity with computational efficiency. Extensive experiments on three benchmark datasets—JAADbeh, JAADall, and PIE—achieve accuracy rates of 73%, 93%, and 90%, respectively, surpassing existing state-of-the-art methods. These results empirically validate that synergistic multi-context modeling significantly enhances both robustness and generalizability in pedestrian intention prediction.
📝 Abstract
Pedestrian crossing intention prediction is essential for autonomous vehicles to improve pedestrian safety and reduce traffic accidents. However, accurate pedestrian intention prediction in urban environments remains challenging due to the multitude of factors affecting pedestrian behavior. In this paper, we propose a multi-context fusion Transformer (MFT) that leverages diverse numerical contextual attributes across four key dimensions, encompassing pedestrian behavior context, environmental context, pedestrian localization context and vehicle motion context, to enable accurate pedestrian intention prediction. MFT employs a progressive fusion strategy, where mutual intra-context attention enables reciprocal interactions within each context, thereby facilitating feature sequence fusion and yielding a context token as a context-specific representation. This is followed by mutual cross-context attention, which integrates features across contexts with a global CLS token serving as a compact multi-context representation. Finally, guided intra-context attention refines context tokens within each context through directed interactions, while guided cross-context attention strengthens the global CLS token to promote multi-context fusion via guided information propagation, yielding deeper and more efficient integration. Experimental results validate the superiority of MFT over state-of-the-art methods, achieving accuracy rates of 73%, 93%, and 90% on the JAADbeh, JAADall, and PIE datasets, respectively. Extensive ablation studies are further conducted to investigate the effectiveness of the network architecture and contribution of different input context. Our code is open-source: https://github.com/ZhongHang0307/Multi-Context-Fusion-Transformer.