From Knowing to Doing Precisely: A General Self-Correction and Termination Framework for VLA models

📅 2026-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing vision-language-action (VLA) models in robotic grasping tasks, which are prone to action-space biases leading to grasp failures and suffer from inaccurate task-completion judgments that cause redundant actions or timeout errors. To overcome these issues, the authors propose VLA-SCT, a novel framework that introduces, for the first time, a lightweight, training-free, general-purpose self-correction and termination mechanism. By integrating data-driven action refinement with a conditional logical termination strategy, VLA-SCT establishes a closed-loop control system that significantly enhances execution accuracy, task-completion assessment, and overall robustness of VLA models in complex environments. The method achieves consistent performance gains across all datasets in the LIBERO benchmark, with particularly substantial improvements in success rates on fine-grained manipulation tasks.

Technology Category

Application Category

📝 Abstract
While vision-language-action (VLA) models for embodied agents integrate perception, reasoning, and control, they remain constrained by two critical weaknesses: first, during grasping tasks, the action tokens generated by the language model often exhibit subtle spatial deviations from the target object, resulting in grasp failures; second, they lack the ability to reliably recognize task completion, which leads to redundant actions and frequent timeout errors. To address these challenges and enhance robustness, we propose a lightweight, training-free framework, VLA-SCT. This framework operates as a self-correcting control loop, combining data-driven action refinement with conditional logic for termination. Consequently, compared to baseline approaches, our method achieves consistent improvements across all datasets in the LIBERO benchmark, significantly increasing the success rate of fine manipulation tasks and ensuring accurate task completion, thereby promoting the deployment of more reliable VLA agents in complex, unstructured environments.
Problem

Research questions and friction points this paper is trying to address.

vision-language-action
grasping
task completion
action deviation
embodied agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language-action
self-correction
task termination
embodied agents
training-free framework
🔎 Similar Papers
No similar papers found.
Wentao Zhang
Wentao Zhang
Institute of Physics, Chinese Academy of Sciences
photoemissionsuperconductivitycupratehtsctime-resolved
A
Aolan Sun
Ping An Technology (Shenzhen) Co., Ltd., Shenzhen, China; Corresponding author
Wentao Mo
Wentao Mo
Tsinghua University
Trustworthy Artificial IntelligenceMultimodal Learning
X
Xiaoyang Qu
Ping An Technology (Shenzhen) Co., Ltd., Shenzhen, China
Y
Yuxin Zheng
Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China; Ping An Technology (Shenzhen) Co., Ltd., Shenzhen, China
Jianzong Wang
Jianzong Wang
Postdoctoral Researcher of Department of Electrical and Computer Engineering, University of Florida
Big DataStorage SystemCloud Computing