Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This work addresses the limited error-correction capability of existing vision-language-action (VLA) models under out-of-distribution shifts or visually ambiguous conditions, stemming from their inability to effectively integrate human spatial guidance. To overcome this, we propose GTA-VLA, a novel framework that, for the first time, incorporates user-provided spatial priors—such as actionable points, bounding boxes, or trajectories—into the intermediate reasoning stages of VLA models. This integration establishes a unified spatial-visual chain-of-thought, enabling human-in-the-loop embodied reasoning and interactive error correction. Leveraging a lightweight reactive action head, our approach supports efficient and intervenable decision-making, achieving an 81.2% success rate on the SimplerEnv WidowX benchmark. Under domain-shift and visual-ambiguity challenges, GTA-VLA significantly outperforms current methods with just a single visual interaction, markedly enhancing robustness and alignment with user intent.

📝 Abstract

In this paper, we propose GTA-VLA(Guide, Think, Act), an interactive Vision-Language-Action (VLA) framework that enables spatially steerable embodied reasoning by allowing users to guide robot policies with explicit visual cues. Existing VLA models learn a direct "Sense-to-Act" mapping from multimodal observations to robot actions. While effective within the training distribution, such tightly coupled policies are brittle under out-of-domain (OOD) shifts and difficult to correct when failures occur. Although recent embodied Chain-of-Thought (CoT) approaches expose intermediate reasoning, they still lack a mechanism for incorporating human spatial guidance, limiting their ability to resolve visual ambiguities or recover from mistakes. To address this gap, our framework allows users to optionally guide the policy with spatial priors, such as affordance points, boxes, and traces, which the subsequent reasoning process can directly condition on. Based on these inputs, the model generates a unified spatial-visual Chain-of-Thought that integrates external guidance with internal task planning, aligning human visual intent with autonomous decision-making. For practical deployment, we further couple the reasoning module with a lightweight reactive action head for efficient action execution. Extensive experiments demonstrate the effectiveness of our approach. On the in-domain SimplerEnv WidowX benchmark, our framework achieves a state-of-the-art 81.2% success rate. Under OOD visual shifts and spatial ambiguities, a single visual interaction substantially improves task success over existing methods, highlighting the value of interactive reasoning for failure recovery in embodied control. Details of the project can be found here: https://signalispupupu.github.io/GTA-VLA_ProjPage/

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action

out-of-distribution

spatial guidance

embodied reasoning

failure recovery

Innovation

Methods, ideas, or system contributions that make the work stand out.

interactive reasoning

spatial guidance

vision-language-action models