🤖 AI Summary
This work addresses the limitations of existing vision-language-action (VLA) systems, which typically rely on direct action prediction and struggle with long-horizon reasoning and consequence evaluation. The authors propose the World-Value-Action (WAV) model, which performs implicit planning in a structured latent space by integrating a world model for future state prediction with a trajectory value function to assess long-term utility. Action generation is thereby reformulated as latent-space inference toward high-value, dynamically feasible trajectories. This approach circumvents explicit trajectory optimization and theoretically mitigates the exponential decay in the probability of feasible trajectories over long horizons. Experiments demonstrate that WAV significantly outperforms current methods in both simulation and real-world settings, achieving consistent improvements in task success rate, generalization, and robustness—particularly excelling in long-horizon and compositional tasks.
📝 Abstract
Vision-Language-Action (VLA) models have emerged as a promising paradigm for building embodied agents that ground perception and language into action. However, most existing approaches rely on direct action prediction, lacking the ability to reason over long-horizon trajectories and evaluate their consequences, which limits performance in complex decision-making tasks. In this work, we introduce World-Value-Action (WAV) model, a unified framework that enables implicit planning in VLA systems. Rather than performing explicit trajectory optimization, WAV model learn a structured latent representation of future trajectories conditioned on visual observations and language instructions. A learned world model predicts future states, while a trajectory value function evaluates their long-horizon utility. Action generation is then formulated as inference in this latent space, where the model progressively concentrates probability mass on high-value and dynamically feasible trajectories. We provide a theoretical perspective showing that planning directly in action space suffers from an exponential decay in the probability of feasible trajectories as the horizon increases. In contrast, latent-space inference reshapes the search distribution toward feasible regions, enabling efficient long-horizon decision making. Extensive simulations and real-world experiments demonstrate that the WAV model consistently outperforms state-of-the-art methods, achieving significant improvements in task success rate, generalization ability, and robustness, especially in long-horizon and compositional scenarios.