DriveMRP: Enhancing Vision-Language Models with Synthetic Motion Data for Motion Risk Prediction

๐Ÿ“… 2025-06-28
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the challenges of motion risk prediction in autonomous drivingโ€™s long-tail scenarios and weak model generalization due to scarcity of real high-risk data, this paper proposes the DriveMRP-Agent framework and the DriveMRP-10K synthetic dataset, introducing the first birdโ€™s-eye-view (BEV) multi-agent joint risk modeling approach. Our method leverages trajectory projection, global contextual injection, and fine-tuned vision-language models (VLMs) to construct a VLM-agnostic risk assessment architecture, enabling unified inference of ego-vehicle, traffic-agent, and environmental risks. On synthetic data, accident identification accuracy reaches 88.03% (+60.9 percentage points); zero-shot performance on real high-risk benchmarks achieves 68.50% (+39.1 percentage points), significantly improving cross-domain generalization. Key contributions include: (1) a novel paradigm for high-risk motion data synthesis; (2) a BEV-based multi-agent joint risk modeling mechanism; and (3) a decoupled VLM architecture for risk reasoning.

Technology Category

Application Category

๐Ÿ“ Abstract
Autonomous driving has seen significant progress, driven by extensive real-world data. However, in long-tail scenarios, accurately predicting the safety of the ego vehicle's future motion remains a major challenge due to uncertainties in dynamic environments and limitations in data coverage. In this work, we aim to explore whether it is possible to enhance the motion risk prediction capabilities of Vision-Language Models (VLM) by synthesizing high-risk motion data. Specifically, we introduce a Bird's-Eye View (BEV) based motion simulation method to model risks from three aspects: the ego-vehicle, other vehicles, and the environment. This allows us to synthesize plug-and-play, high-risk motion data suitable for VLM training, which we call DriveMRP-10K. Furthermore, we design a VLM-agnostic motion risk estimation framework, named DriveMRP-Agent. This framework incorporates a novel information injection strategy for global context, ego-vehicle perspective, and trajectory projection, enabling VLMs to effectively reason about the spatial relationships between motion waypoints and the environment. Extensive experiments demonstrate that by fine-tuning with DriveMRP-10K, our DriveMRP-Agent framework can significantly improve the motion risk prediction performance of multiple VLM baselines, with the accident recognition accuracy soaring from 27.13% to 88.03%. Moreover, when tested via zero-shot evaluation on an in-house real-world high-risk motion dataset, DriveMRP-Agent achieves a significant performance leap, boosting the accuracy from base_model's 29.42% to 68.50%, which showcases the strong generalization capabilities of our method in real-world scenarios.
Problem

Research questions and friction points this paper is trying to address.

Enhancing motion risk prediction in autonomous driving using synthetic data
Addressing data limitations in long-tail scenarios for safety prediction
Improving Vision-Language Models' spatial reasoning for risk estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

BEV-based motion simulation for risk modeling
Synthetic high-risk motion data DriveMRP-10K
VLM-agnostic framework DriveMRP-Agent
๐Ÿ”Ž Similar Papers
Z
Zhiyi Hou
Westlake University
Enhui Ma
Enhui Ma
Westlake University
Computer Vision
F
Fang Li
Xiaomi EV
Z
Zhiyi Lai
Xiaomi EV
K
Kalok Ho
Xiaomi EV
Z
Zhanqian Wu
Xiaomi EV
Lijun Zhou
Lijun Zhou
Xiaomi Corporation
L
Long Chen
Xiaomi EV
C
Chitian Sun
Xiaomi EV
H
Haiyang Sun
Xiaomi EV
B
Bing Wang
Xiaomi EV
G
Guang Chen
Xiaomi EV
H
Hangjun Ye
Xiaomi EV
Kaicheng Yu
Kaicheng Yu
Assistant Professor, Westlake University, PI of Autonomous Intelligence Lab
computer vision3D understandingautonomous perceptionautomatic machine learning