DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA

📅 2026-03-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing end-to-end vision-language-action (VLA) models, which treat vision-language models (VLMs) merely as multimodal encoders that directly map to low-level actions, thereby neglecting their high-level reasoning capabilities and leading to training instability and semantic degradation. To overcome this, the authors propose a differentiable implicit intention bottleneck that decouples high-level decision-making from low-level execution: a System-2 module leverages the VLM for implicit world modeling to generate visual look-ahead intentions, while a System-1 module employs implicit inverse dynamics to decode these intentions together with current observations into precise motor actions. A two-stage training strategy—comprising decoupled warm-up followed by joint optimization—enables stable end-to-end learning. This approach achieves the first intention–action disentanglement within the VLM’s native feature space, attaining a new state of the art on the RoboCasa GR1 Tabletop benchmark with less than 10% of the demonstration data used by prior methods and demonstrating strong zero-shot generalization on a real humanoid robot.
📝 Abstract
The development of Vision-Language-Action (VLA) models has been significantly accelerated by pre-trained Vision-Language Models (VLMs). However, most existing end-to-end VLAs treat the VLM primarily as a multimodal encoder, directly mapping vision-language features to low-level actions. This paradigm underutilizes the VLM's potential in high-level decision making and introduces training instability, frequently degrading its rich semantic representations. To address these limitations, we introduce DIAL, a framework bridging high-level decision making and low-level motor execution through a differentiable latent intent bottleneck. Specifically, a VLM-based System-2 performs latent world modeling by synthesizing latent visual foresight within the VLM's native feature space; this foresight explicitly encodes intent and serves as the structural bottleneck. A lightweight System-1 policy then decodes this predicted intent together with the current observation into precise robot actions via latent inverse dynamics. To ensure optimization stability, we employ a two-stage training paradigm: a decoupled warmup phase where System-2 learns to predict latent futures while System-1 learns motor control under ground-truth future guidance within a unified feature space, followed by seamless end-to-end joint optimization. This enables action-aware gradients to refine the VLM backbone in a controlled manner, preserving pre-trained knowledge. Extensive experiments on the RoboCasa GR1 Tabletop benchmark show that DIAL establishes a new state-of-the-art, achieving superior performance with 10x fewer demonstrations than prior methods. Furthermore, by leveraging heterogeneous human demonstrations, DIAL learns physically grounded manipulation priors and exhibits robust zero-shot generalization to unseen objects and novel configurations during real-world deployment on a humanoid robot.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
end-to-end VLA
training instability
semantic representation degradation
high-level decision making
Innovation

Methods, ideas, or system contributions that make the work stand out.

latent intent bottleneck
decoupled VLA architecture
latent world modeling
two-stage training
vision-language-action
🔎 Similar Papers
No similar papers found.