🤖 AI Summary
This work addresses the limitations of existing vision-language-action (VLA) models in robotic grasping tasks, which are prone to action-space biases leading to grasp failures and suffer from inaccurate task-completion judgments that cause redundant actions or timeout errors. To overcome these issues, the authors propose VLA-SCT, a novel framework that introduces, for the first time, a lightweight, training-free, general-purpose self-correction and termination mechanism. By integrating data-driven action refinement with a conditional logical termination strategy, VLA-SCT establishes a closed-loop control system that significantly enhances execution accuracy, task-completion assessment, and overall robustness of VLA models in complex environments. The method achieves consistent performance gains across all datasets in the LIBERO benchmark, with particularly substantial improvements in success rates on fine-grained manipulation tasks.
📝 Abstract
While vision-language-action (VLA) models for embodied agents integrate perception, reasoning, and control, they remain constrained by two critical weaknesses: first, during grasping tasks, the action tokens generated by the language model often exhibit subtle spatial deviations from the target object, resulting in grasp failures; second, they lack the ability to reliably recognize task completion, which leads to redundant actions and frequent timeout errors. To address these challenges and enhance robustness, we propose a lightweight, training-free framework, VLA-SCT. This framework operates as a self-correcting control loop, combining data-driven action refinement with conditional logic for termination. Consequently, compared to baseline approaches, our method achieves consistent improvements across all datasets in the LIBERO benchmark, significantly increasing the success rate of fine manipulation tasks and ensuring accurate task completion, thereby promoting the deployment of more reliable VLA agents in complex, unstructured environments.