🤖 AI Summary
This paper addresses the challenging problem of pedestrian crossing-intention prediction in autonomous driving. We propose a multimodal fusion framework that integrates six modalities—global semantic maps, local RGB frames, optical flow, ego-vehicle speed, pedestrian bounding boxes, and trajectories—via a dual-path attention mechanism and a cross-modal dynamic modeling module. A temporal Transformer aggregation module captures behavioral evolution over time. Our key contributions include: (i) a synergistic design of optical-flow-guided attention and cross-modal attention, and (ii) multi-granularity temporal modeling. The method achieves 70% and 89% accuracy on the JAADbeh and JAADall benchmarks, respectively, outperforming state-of-the-art approaches. Ablation studies confirm the effectiveness and complementary nature of each component.
📝 Abstract
Predicting pedestrian crossing intention is crucial for autonomous vehicles to prevent pedestrian-related collisions. However, effectively extracting and integrating complementary cues from different types of data remains one of the major challenges. This paper proposes an attention-guided cross-modal interaction Transformer (ACIT) for pedestrian crossing intention prediction. ACIT leverages six visual and motion modalities, which are grouped into three interaction pairs: (1) Global semantic map and global optical flow, (2) Local RGB image and local optical flow, and (3) Ego-vehicle speed and pedestrian's bounding box. Within each visual interaction pair, a dual-path attention mechanism enhances salient regions within the primary modality through intra-modal self-attention and facilitates deep interactions with the auxiliary modality (i.e., optical flow) via optical flow-guided attention. Within the motion interaction pair, cross-modal attention is employed to model the cross-modal dynamics, enabling the effective extraction of complementary motion features. Beyond pairwise interactions, a multi-modal feature fusion module further facilitates cross-modal interactions at each time step. Furthermore, a Transformer-based temporal feature aggregation module is introduced to capture sequential dependencies. Experimental results demonstrate that ACIT outperforms state-of-the-art methods, achieving accuracy rates of 70% and 89% on the JAADbeh and JAADall datasets, respectively. Extensive ablation studies are further conducted to investigate the contribution of different modules of ACIT.