Sentinel-VLA: A Metacognitive VLA Model with Active Status Monitoring for Dynamic Reasoning and Error Recovery

📅 2026-05-01

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work addresses the limited reasoning capabilities and absence of execution monitoring and self-correction mechanisms in existing Vision-Language-Action (VLA) models for embodied manipulation. To overcome these limitations, the authors propose Sentinel-VLA, which introduces a novel “Sentinel” module endowed with metacognitive abilities. This module employs a dynamic reasoning triggering strategy, activating deliberative planning and recovery only when initial actions fail or errors are detected, thereby enabling on-demand, efficient decision-making. The framework further integrates a Self-Evolving Continual Learning (SECL) algorithm with Orthogonal Continual Adapters (OC-Adapter), effectively mitigating catastrophic forgetting while expanding functional capabilities. Evaluated in real-world environments, Sentinel-VLA achieves over a 30% improvement in task success rate compared to the current state-of-the-art PI0 model. The code, model weights, and an automated data generation pipeline are publicly released.

📝 Abstract

Vision-language-action (VLA) models have advanced the field of embodied manipulation by harnessing broad world knowledge and strong generalization. However, current VLA models still face several key challenges, including limited reasoning capability, lack of status monitoring, and difficulty in self-correction. In this paper, we introduce \textbf{Sentinel-VLA}, a metacognitive VLA model equipped with an active ``sentinel'' module to monitor real-time execution status. Only when necessary, such as during initial planning or upon detecting an error, the model triggers a dynamic reasoning or formulate error recovery solutions. This on-demand reasoning mechanism ensures robust decision-making while minimizing computational overhead. Notably, all training data (spanning 44 tasks and over 2.6 million transitions) is automatically generated and annotated through our designed pipeline. We also propose the Self-Evolving Continual Learning (SECL) algorithm, which allows Sentinel-VLA to identify its capability boundaries and automatically collect data for expansion, paired with Orthogonal Continual Adapter (OC-Adapter) to constrain parameter updates to an orthogonal space, thereby preventing catastrophic forgetting. Real-world experiments demonstrate that Sentinel-VLA boosts the task success rate by over 30\% compared to the SOTA model, PI0. We will open-source all the code, weights, and data generation pipeline.

Problem

Research questions and friction points this paper is trying to address.

vision-language-action

reasoning capability

status monitoring

error recovery

embodied manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

metacognitive VLA

active status monitoring

on-demand reasoning