World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems

📅 2026-04-16

📈 Citations: 0

✨ Influential: 0

career value

238K/year

🤖 AI Summary

This work addresses the limitations of existing vision-language-action (VLA) systems, which typically rely on direct action prediction and struggle with long-horizon reasoning and consequence evaluation. The authors propose the World-Value-Action (WAV) model, which performs implicit planning in a structured latent space by integrating a world model for future state prediction with a trajectory value function to assess long-term utility. Action generation is thereby reformulated as latent-space inference toward high-value, dynamically feasible trajectories. This approach circumvents explicit trajectory optimization and theoretically mitigates the exponential decay in the probability of feasible trajectories over long horizons. Experiments demonstrate that WAV significantly outperforms current methods in both simulation and real-world settings, achieving consistent improvements in task success rate, generalization, and robustness—particularly excelling in long-horizon and compositional tasks.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models have emerged as a promising paradigm for building embodied agents that ground perception and language into action. However, most existing approaches rely on direct action prediction, lacking the ability to reason over long-horizon trajectories and evaluate their consequences, which limits performance in complex decision-making tasks. In this work, we introduce World-Value-Action (WAV) model, a unified framework that enables implicit planning in VLA systems. Rather than performing explicit trajectory optimization, WAV model learn a structured latent representation of future trajectories conditioned on visual observations and language instructions. A learned world model predicts future states, while a trajectory value function evaluates their long-horizon utility. Action generation is then formulated as inference in this latent space, where the model progressively concentrates probability mass on high-value and dynamically feasible trajectories. We provide a theoretical perspective showing that planning directly in action space suffers from an exponential decay in the probability of feasible trajectories as the horizon increases. In contrast, latent-space inference reshapes the search distribution toward feasible regions, enabling efficient long-horizon decision making. Extensive simulations and real-world experiments demonstrate that the WAV model consistently outperforms state-of-the-art methods, achieving significant improvements in task success rate, generalization ability, and robustness, especially in long-horizon and compositional scenarios.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action

long-horizon planning

trajectory reasoning

embodied agents

action prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

implicit planning

vision-language-action

latent trajectory inference