Self-Correcting VLA: Online Action Refinement via Sparse World Imagination

📅 2026-02-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of current vision-language-action (VLA) models, which rely heavily on statistical priors and lack deep understanding of physical dynamics, thereby hindering autonomous improvement. To overcome this, the authors propose a sparse world imagination mechanism that enables online optimization of action sequences by modeling predictive task progress and trajectory trends. The approach incorporates an intrinsic prediction–based self-correcting control policy and employs auxiliary prediction heads to model short-term physical evolution. By combining sparse future state predictions with dense reward reconstruction, the method dynamically adjusts its policy during execution. Evaluated on both simulated and real-world robotic manipulation tasks, the proposed framework achieves state-of-the-art performance, improving task success rates by 9%, reducing execution steps by 16%, and enhancing real-world performance by 14%.

Technology Category

Application Category

📝 Abstract
Standard vision-language-action (VLA) models rely on fitting statistical data priors, limiting their robust understanding of underlying physical dynamics. Reinforcement learning enhances physical grounding through exploration yet typically relies on external reward signals that remain isolated from the agent's internal states. World action models have emerged as a promising paradigm that integrates imagination and control to enable predictive planning. However, they rely on implicit context modeling, lacking explicit mechanisms for self-improvement. To solve these problems, we propose Self-Correcting VLA (SC-VLA), which achieve self-improvement by intrinsically guiding action refinement through sparse imagination. We first design sparse world imagination by integrating auxiliary predictive heads to forecast current task progress and future trajectory trends, thereby constraining the policy to encode short-term physical evolution. Then we introduce the online action refinement module to reshape progress-dependent dense rewards, adjusting trajectory orientation based on the predicted sparse future states. Evaluations on challenging robot manipulation tasks from simulation benchmarks and real-world settings demonstrate that SC-VLA achieve state-of-the-art performance, yielding the highest task throughput with 16% fewer steps and a 9% higher success rate than the best-performing baselines, alongside a 14% gain in real-world experiments. Code is available at https://github.com/Kisaragi0/SC-VLA.
Problem

Research questions and friction points this paper is trying to address.

vision-language-action
physical dynamics
self-improvement
world model
reward signal
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Correcting VLA
Sparse World Imagination
Online Action Refinement
Predictive Planning
Intrinsic Reward Shaping