Multi-ORFT: Stable Online Reinforcement Fine-Tuning for Multi-Agent Diffusion Planning in Cooperative Driving

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work addresses key limitations of existing diffusion-based planners in closed-loop cooperative driving, including weak scene consistency, misalignment with control objectives, and instability during online fine-tuning. To overcome these challenges, the authors propose a novel framework that integrates scene-conditioned diffusion pre-training with stable online reinforcement post-training. During pre-training, self-attention, cross-attention, and AdaLN-Zero mechanisms enhance alignment between predicted trajectories and surrounding scenes. In the post-training phase, a two-layer Markov decision process (MDP) is introduced, leveraging backward kernel likelihood for online optimization and a newly designed variance-gated group relative policy optimization (VG-GRPO) algorithm to improve training stability. Evaluated on the WOMD closed-loop benchmark, the method achieves a collision rate of 1.89%, an off-road rate of 1.36%, and an average speed of 8.61 m/s, significantly outperforming current baselines.

Technology Category

Application Category

📝 Abstract

Closed-loop cooperative driving requires planners that generate realistic multimodal multi-agent trajectories while improving safety and traffic efficiency. Existing diffusion planners can model multimodal behaviors from demonstrations, but they often exhibit weak scene consistency and remain poorly aligned with closed-loop objectives; meanwhile, stable online post-training in reactive multi-agent environments remains difficult. We present Multi-ORFT, which couples scene-conditioned diffusion pre-training with stable online reinforcement post-training. In pre-training, the planner uses inter-agent self-attention, cross-attention, and AdaLN-Zero-based scene conditioning to improve scene consistency and road adherence of joint trajectories. In post-training, we formulate a two-level MDP that exposes step-wise reverse-kernel likelihoods for online optimization, and combine dense trajectory-level rewards with variance-gated group-relative policy optimization (VG-GRPO) to stabilize training. On the WOMD closed-loop benchmark, Multi-ORFT reduces collision rate from 2.04% to 1.89% and off-road rate from 1.68% to 1.36%, while increasing average speed from 8.36 to 8.61 m/s relative to the pre-trained planner, and it outperforms strong open-source baselines including SMART-large, SMART-tiny-CLSFT, and VBD on the primary safety and efficiency metrics. These results show that coupling scene-consistent denoising with stable online diffusion-policy optimization improves the reliability of closed-loop cooperative driving.

Problem

Research questions and friction points this paper is trying to address.

cooperative driving

diffusion planning

online reinforcement fine-tuning

multi-agent trajectory

scene consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion planning

online reinforcement fine-tuning

multi-agent cooperation