🤖 AI Summary
Existing vision-language-action models lack explicit task progress awareness and rely on heuristic rules for termination, limiting their performance in long-horizon, multi-subtask scenarios. This work proposes a zero-shot generalizable approach to task progress estimation by leveraging large-scale unsupervised video-text pretraining to construct a robust progress estimator. Integrating a differentiable inverse dynamics world model, the method associates action prediction with future visual states and introduces a maximum progress regularization to enable end-to-end differentiable action optimization. Evaluated on the CALVIN and LIBERO benchmarks as well as real robotic platforms, the proposed approach significantly improves task success rates and generalization capabilities, achieving a remarkably low progress prediction residual of 0.07 in simulation.
📝 Abstract
Most existing vision-language-action (VLA) models for robotic manipulation lack progress awareness, typically relying on hand-crafted heuristics for task termination. This limitation is particularly severe in long-horizon tasks involving cascaded sub-goals. In this work, we investigate the estimation and integration of task progress, proposing a novel model named {\textbf \vla}. Our technical contributions are twofold: (1) \emph{robust progress estimation}: We pre-train a progress estimator on large-scale, unsupervised video-text robotic datasets. This estimator achieves a low prediction residual (0.07 on a scale of $[0, 1]$) in simulation and demonstrates zero-shot generalization to unseen real-world samples, and (2) \emph{differentiable progress guidance}: We introduce an inverse dynamics world model that maps predicted action tokens into future latent visual states. These latents are then processed by the progress estimator; by applying a maximal progress regularization, we establish a differentiable pipeline that provides progress-piloted guidance to refine action tokens. Extensive experiments on the CALVIN and LIBERO benchmarks, alongside real-world robot deployment, consistently demonstrate substantial improvements in success rates and generalization over strong baselines.