PIP-Net: Pedestrian Intention Prediction in the Wild

📅 2024-02-20
🏛️ IEEE transactions on intelligent transportation systems (Print)
📈 Citations: 8
Influential: 1
📄 PDF
🤖 AI Summary
Pedestrian intention prediction (PIP) in real-world urban road scenarios remains a critical challenge for autonomous driving perception systems. Method: This paper proposes PIP-Net, a dual-variant framework that jointly models kinematic dynamics and spatial scene context, supporting both single- and multi-camera inputs. It introduces a novel cross-modal fusion mechanism integrating classification-oriented deep feature maps with local motion flow features, and employs a time-attention-enhanced RNN architecture. Contribution/Results: We present Urban-PIP—the first large-scale, multi-camera synchronized dataset with real-road pedestrian crossing annotations. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches under a 4-second prediction horizon, achieving substantial gains in crossing-intention classification accuracy. The proposed framework establishes a new, highly robust and scalable paradigm for pedestrian intention prediction in complex urban environments.

Technology Category

Application Category

📝 Abstract
Accurate pedestrian intention prediction (PIP) by Autonomous Vehicles (AVs) is one of the current research challenges in this field. In this article, we introduce PIP-Net, a novel framework designed to predict pedestrian crossing intentions by AVs in real-world urban scenarios. We offer two variants of PIP-Net designed for different camera mounts and setups. Leveraging both kinematic data and spatial features from the driving scene, the proposed model employs a recurrent and temporal attention-based solution, outperforming state-of-the-art performance. To enhance the visual representation of road users and their proximity to the ego vehicle, we introduce a categorical depth feature map, combined with a local motion flow feature, providing rich insights into the scene dynamics. Additionally, we explore the impact of expanding the camera’s field of view, from one to three cameras surrounding the ego vehicle, leading to an enhancement in the model’s contextual perception. Depending on the traffic scenario and road environment, the model excels in predicting pedestrian crossing intentions up to 4 seconds in advance, which is a breakthrough in current research studies in pedestrian intention prediction. Finally, for the first time, we present the Urban-PIP dataset, a customised pedestrian intention prediction dataset, with multi-camera annotations in real-world automated driving scenarios.
Problem

Research questions and friction points this paper is trying to address.

Predict pedestrian crossing intentions for autonomous vehicles
Enhance prediction using kinematic data and spatial features
Improve contextual perception with multi-camera setups
Innovation

Methods, ideas, or system contributions that make the work stand out.

Recurrent and temporal attention-based solution
Categorical depth and motion flow features
Multi-camera setup for contextual perception
🔎 Similar Papers
No similar papers found.
M
Mohsen Azarmi
Institute for Transport Studies, University of Leeds, United Kingdom
Mahdi Rezaei
Mahdi Rezaei
Associate Professor, University of Leeds
AIComputer VisionMachine LearningAutonomous VehiclesLarge Language Models
H
He Wang
Department of Computer Science, University College London, United Kingdom