Action Emergence from Streaming Intent

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses the challenge of generating physically feasible, semantically coherent, and safety-compliant driving actions in long-tail traffic scenarios within end-to-end autonomous driving systems. To this end, the authors propose a Streaming Intent mechanism that decodes continuous intent tokens through an autoregressive four-step chain-of-thought process, ensuring temporal consistency of high-level intent and guiding low-level action generation. This approach achieves intent-faithful controllability for the first time in a purely data-driven Vision-Language-Action (VLA) model, without relying on trajectory libraries or handcrafted post-processing. By integrating classifier-free guidance (CFG) with a flow-matching action head, the method produces high-quality, diverse driving trajectories in just two denoising steps. Evaluated on the Waymo end-to-end benchmark, it attains RFS scores of 7.96 and 7.74 on the validation and test sets, respectively, while demonstrating precise responsiveness to diverse inference-time intents.

📝 Abstract

We formalize action emergence as a target capability for end-to-end autonomous driving: the ability to generate physically feasible, semantically appropriate, and safety-compliant actions in arbitrary, long-tail traffic scenes through scene-conditioned reasoning rather than retrieval or interpolation of learned scene-action mappings. We show that previous paradigms cannot deliver action emergence: autoregressive trajectory decoders collapse the inherently multimodal future into a single averaged output, while diffusion and flow-matching generators express multimodality but are not steerable by reasoned intent. We propose Streaming Intent as a concrete way to approach action emergence: a mechanism that makes driving intent (i) semantically streamed through a continuous chain-of-thought that causally derives the intent from scene understanding, and (ii) temporally streamed across clips so that intent commitments remain coherent along the driving horizon. We realize Streaming Intent in a VLA model we call SI (Streaming Intent). SI autoregressively decodes a four-step chain-of-thought and emits an intent token; the decoded intent then drives classifier-free guidance (CFG) on a flow-matching action head, requiring only two denoising steps to generate the final trajectory. On the Waymo End-to-End benchmark, SI achieves competitive aggregate performance, with an RFS score of 7.96 on the validation set and 7.74 on the test set. Beyond aggregate metrics, the model demonstrates -- to our knowledge for the first time in a fully end-to-end VLA -- intent-faithful controllability: for a fixed scene, varying the intent class at inference yields qualitatively distinct yet consistently high-quality plans, arising purely from data-driven learning without any pre-built trajectory bank or hand-coded post-hoc selector.

Problem

Research questions and friction points this paper is trying to address.

action emergence

end-to-end autonomous driving

scene-conditioned reasoning

long-tail traffic scenes

driving intent

Innovation

Methods, ideas, or system contributions that make the work stand out.

Streaming Intent

Action Emergence

Chain-of-Thought Reasoning