Multi-ORFT: Stable Online Reinforcement Fine-Tuning for Multi-Agent Diffusion Planning in Cooperative Driving

📅 2026-04-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

228K/year
🤖 AI Summary
This work addresses key limitations of existing diffusion-based planners in closed-loop cooperative driving, including weak scene consistency, misalignment with control objectives, and instability during online fine-tuning. To overcome these challenges, the authors propose a novel framework that integrates scene-conditioned diffusion pre-training with stable online reinforcement post-training. During pre-training, self-attention, cross-attention, and AdaLN-Zero mechanisms enhance alignment between predicted trajectories and surrounding scenes. In the post-training phase, a two-layer Markov decision process (MDP) is introduced, leveraging backward kernel likelihood for online optimization and a newly designed variance-gated group relative policy optimization (VG-GRPO) algorithm to improve training stability. Evaluated on the WOMD closed-loop benchmark, the method achieves a collision rate of 1.89%, an off-road rate of 1.36%, and an average speed of 8.61 m/s, significantly outperforming current baselines.

Technology Category

Application Category

📝 Abstract
Closed-loop cooperative driving requires planners that generate realistic multimodal multi-agent trajectories while improving safety and traffic efficiency. Existing diffusion planners can model multimodal behaviors from demonstrations, but they often exhibit weak scene consistency and remain poorly aligned with closed-loop objectives; meanwhile, stable online post-training in reactive multi-agent environments remains difficult. We present Multi-ORFT, which couples scene-conditioned diffusion pre-training with stable online reinforcement post-training. In pre-training, the planner uses inter-agent self-attention, cross-attention, and AdaLN-Zero-based scene conditioning to improve scene consistency and road adherence of joint trajectories. In post-training, we formulate a two-level MDP that exposes step-wise reverse-kernel likelihoods for online optimization, and combine dense trajectory-level rewards with variance-gated group-relative policy optimization (VG-GRPO) to stabilize training. On the WOMD closed-loop benchmark, Multi-ORFT reduces collision rate from 2.04% to 1.89% and off-road rate from 1.68% to 1.36%, while increasing average speed from 8.36 to 8.61 m/s relative to the pre-trained planner, and it outperforms strong open-source baselines including SMART-large, SMART-tiny-CLSFT, and VBD on the primary safety and efficiency metrics. These results show that coupling scene-consistent denoising with stable online diffusion-policy optimization improves the reliability of closed-loop cooperative driving.
Problem

Research questions and friction points this paper is trying to address.

cooperative driving
diffusion planning
online reinforcement fine-tuning
multi-agent trajectory
scene consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion planning
online reinforcement fine-tuning
multi-agent cooperation
scene consistency
variance-gated policy optimization
🔎 Similar Papers
2024-10-07arXiv.orgCitations: 9
H
Haojie Bai
school of Information Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen 518071, China
Aimin Li
Aimin Li
Ph.D candidate, Harbin Institute of Technology (Shenzhen), China
Information theorygoal-oriented communicationsAge of Information
R
Ruoyu Yao
Robotics and Autonomous Systems Thrust, Hong Kong University of Science and Technology (Guangzhou), Guangzhou 511453, China
Xiongwei Zhao
Xiongwei Zhao
Ph.D Candidate, Harbin Institute of Technology
3D PerceptionWorld ModelLLMEmbodied AIAutonomous System
T
Tingting Zhang
school of Information Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen 518071, China
X
Xing Zhang
School of Computer Science and Technology, Qinghai University, Xining, 810016, China
Lin Gao
Lin Gao
University of Electronic Science and Technology of China
Information Fusion
Jun Ma
Jun Ma
Assistant Professor, The Hong Kong University of Science and Technology
RoboticsAutonomous DrivingMotion Planning and ControlOptimization