Driving Intents Amplify Planning-Oriented Reinforcement Learning

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses the limitation of continuous-action policies in imitating diverse driving behaviors from single-trajectory demonstrations, where mode collapse often undermines preference optimization. To overcome this, the authors propose DIAL, a two-stage reinforcement learning framework. In the first stage, a discrete driving-intent-conditioned flow-matching action head combined with classifier-free guidance (CFG) broadens the sampling distribution to enhance semantic diversity. The second stage introduces multi-intent Generalized Reinforcement Policy Optimization (GRPO) to preserve behavioral variety during preference fine-tuning. DIAL is the first to integrate discrete intents with CFG to break initial mode collapse and employs multi-intent GRPO to prevent re-collapse. Experiments show that DIAL achieves a best-of-128 Rater Feedback Score (RFS) of 9.14 under intent-CFG sampling—surpassing both human driving (8.13) and the prior state-of-the-art method RAP (8.5)—and improves the holdout-set RFS from 7.681 to 8.211, significantly outperforming single-intent baselines.

📝 Abstract

Continuous-action policies trained on a single demonstrated trajectory per scene suffer from mode collapse: samples cluster around the demonstrated maneuver and the policy cannot represent semantically distinct alternatives. Under preference-based evaluation, this caps best-of-N performance -- even oracle selection cannot recover what the sampling distribution does not contain. We introduce DIAL, a two-stage Driving-Intent-Amplified reinforcement Learning framework for preference-aligned continuous-action driving policies. In the first stage, DIAL conditions the flow-matching action head on a discrete intent label with classifier-free guidance (CFG), which expands the sampling distribution along distinct maneuver modes and breaks single-demonstration mode collapse. In the second stage, DIAL carries this expanded distribution into preference RL through multi-intent GRPO, which spans all intent classes within every preference group and prevents fine-tuning from re-collapsing around the currently preferred mode. Instantiated for end-to-end driving with eight rule-derived intents and evaluated on WOD-E2E: competitive Vision-to-Action (VA) and Vision-Language-Action (VLA) Supervised Finetuning (SFT) baselines plateau below the human-driven demonstration at best-of-128, with the strongest prior (RAP) capping at Rater Feedback Score (RFS) 8.5 even with best-of-64; intent-CFG sampling lifts this ceiling to RFS 9.14 at best-of-128, surpassing both the prior best (RAP 8.5) and the human-driven demonstration (8.13) for the first time; and multi-intent GRPO improves held-out RFS from 7.681 to 8.211, while every single-intent baseline peaks lower and degrades by training end. These results suggest that the bottleneck of preference RL on continuous-action policies trained from demonstrations is not only how to update the policy, but to expand and preserve the sampling distribution being optimized.

Problem

Research questions and friction points this paper is trying to address.

mode collapse

driving intents

preference-based evaluation

continuous-action policies

sampling distribution

Innovation

Methods, ideas, or system contributions that make the work stand out.

Driving Intent

Classifier-Free Guidance

Preference Reinforcement Learning