ACIT: Attention-Guided Cross-Modal Interaction Transformer for Pedestrian Crossing Intention Prediction

📅 2025-11-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenging problem of pedestrian crossing-intention prediction in autonomous driving. We propose a multimodal fusion framework that integrates six modalities—global semantic maps, local RGB frames, optical flow, ego-vehicle speed, pedestrian bounding boxes, and trajectories—via a dual-path attention mechanism and a cross-modal dynamic modeling module. A temporal Transformer aggregation module captures behavioral evolution over time. Our key contributions include: (i) a synergistic design of optical-flow-guided attention and cross-modal attention, and (ii) multi-granularity temporal modeling. The method achieves 70% and 89% accuracy on the JAADbeh and JAADall benchmarks, respectively, outperforming state-of-the-art approaches. Ablation studies confirm the effectiveness and complementary nature of each component.

Technology Category

Application Category

📝 Abstract
Predicting pedestrian crossing intention is crucial for autonomous vehicles to prevent pedestrian-related collisions. However, effectively extracting and integrating complementary cues from different types of data remains one of the major challenges. This paper proposes an attention-guided cross-modal interaction Transformer (ACIT) for pedestrian crossing intention prediction. ACIT leverages six visual and motion modalities, which are grouped into three interaction pairs: (1) Global semantic map and global optical flow, (2) Local RGB image and local optical flow, and (3) Ego-vehicle speed and pedestrian's bounding box. Within each visual interaction pair, a dual-path attention mechanism enhances salient regions within the primary modality through intra-modal self-attention and facilitates deep interactions with the auxiliary modality (i.e., optical flow) via optical flow-guided attention. Within the motion interaction pair, cross-modal attention is employed to model the cross-modal dynamics, enabling the effective extraction of complementary motion features. Beyond pairwise interactions, a multi-modal feature fusion module further facilitates cross-modal interactions at each time step. Furthermore, a Transformer-based temporal feature aggregation module is introduced to capture sequential dependencies. Experimental results demonstrate that ACIT outperforms state-of-the-art methods, achieving accuracy rates of 70% and 89% on the JAADbeh and JAADall datasets, respectively. Extensive ablation studies are further conducted to investigate the contribution of different modules of ACIT.
Problem

Research questions and friction points this paper is trying to address.

Predicting pedestrian crossing intention for autonomous vehicles
Integrating complementary cues from multiple data modalities
Modeling cross-modal dynamics and temporal dependencies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Attention-guided cross-modal interaction Transformer for intention prediction
Dual-path attention mechanism enhances salient regions in primary modality
Transformer-based temporal aggregation captures sequential dependencies
🔎 Similar Papers
No similar papers found.
Y
Yuanzhe Li
Chair of Automotive Engineering, Technische Universität Berlin, Berlin, 13355, Germany
Steffen Müller
Steffen Müller
Professor für Kraftfahrzeugtechnik, TU Berlin
Kraftfahrzeugtechnik