Action Emergence from Streaming Intent

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

237K/year
🤖 AI Summary
This work addresses the challenge of generating physically feasible, semantically coherent, and safety-compliant driving actions in long-tail traffic scenarios within end-to-end autonomous driving systems. To this end, the authors propose a Streaming Intent mechanism that decodes continuous intent tokens through an autoregressive four-step chain-of-thought process, ensuring temporal consistency of high-level intent and guiding low-level action generation. This approach achieves intent-faithful controllability for the first time in a purely data-driven Vision-Language-Action (VLA) model, without relying on trajectory libraries or handcrafted post-processing. By integrating classifier-free guidance (CFG) with a flow-matching action head, the method produces high-quality, diverse driving trajectories in just two denoising steps. Evaluated on the Waymo end-to-end benchmark, it attains RFS scores of 7.96 and 7.74 on the validation and test sets, respectively, while demonstrating precise responsiveness to diverse inference-time intents.
📝 Abstract
We formalize action emergence as a target capability for end-to-end autonomous driving: the ability to generate physically feasible, semantically appropriate, and safety-compliant actions in arbitrary, long-tail traffic scenes through scene-conditioned reasoning rather than retrieval or interpolation of learned scene-action mappings. We show that previous paradigms cannot deliver action emergence: autoregressive trajectory decoders collapse the inherently multimodal future into a single averaged output, while diffusion and flow-matching generators express multimodality but are not steerable by reasoned intent. We propose Streaming Intent as a concrete way to approach action emergence: a mechanism that makes driving intent (i) semantically streamed through a continuous chain-of-thought that causally derives the intent from scene understanding, and (ii) temporally streamed across clips so that intent commitments remain coherent along the driving horizon. We realize Streaming Intent in a VLA model we call SI (Streaming Intent). SI autoregressively decodes a four-step chain-of-thought and emits an intent token; the decoded intent then drives classifier-free guidance (CFG) on a flow-matching action head, requiring only two denoising steps to generate the final trajectory. On the Waymo End-to-End benchmark, SI achieves competitive aggregate performance, with an RFS score of 7.96 on the validation set and 7.74 on the test set. Beyond aggregate metrics, the model demonstrates -- to our knowledge for the first time in a fully end-to-end VLA -- intent-faithful controllability: for a fixed scene, varying the intent class at inference yields qualitatively distinct yet consistently high-quality plans, arising purely from data-driven learning without any pre-built trajectory bank or hand-coded post-hoc selector.
Problem

Research questions and friction points this paper is trying to address.

action emergence
end-to-end autonomous driving
scene-conditioned reasoning
long-tail traffic scenes
driving intent
Innovation

Methods, ideas, or system contributions that make the work stand out.

Streaming Intent
Action Emergence
Chain-of-Thought Reasoning
Flow-Matching
End-to-End Autonomous Driving
🔎 Similar Papers
No similar papers found.