๐ค AI Summary
Current vision-language-action (VLA) models lack the capability to dynamically correct behaviors following task failures in complex robotic control. This paper introduces LITEN, the first framework enabling real-robot trajectory self-correction at inference time. LITEN establishes a closed-loop โreasonโactโevaluateโ architecture by tightly coupling vision-language models (VLMs) with low-level control policies. It incorporates a trajectory-based reflection mechanism and a structured video feedback extraction method to support context-aware, iterative planning. Crucially, LITEN bridges high-level language reasoning with low-level action execution and leverages historical execution traces to generate highly feasible, executable instructions. Experiments demonstrate substantial improvements in success rates for long-horizon robotic tasks. By endowing VLAs with human-like online reflection and adaptive behavior revision, LITEN advances the frontier of embodied AI reasoning and control.
๐ Abstract
Solving complex real-world control tasks often takes multiple tries: if we fail at first, we reflect on what went wrong, and change our strategy accordingly to avoid making the same mistake. In robotics, Vision-Language-Action models (VLAs) offer a promising path towards solving complex control tasks, but lack the ability to contextually and dynamically readjust behavior when they fail to accomplish a task. In this work, we introduce Learning from Inference-Time Execution (LITEN), which connects a VLA low-level policy to a high-level VLM that conditions on past experiences by including them in-context, allowing it to learn the affordances and capabilities of the low-level VLA. Our approach iterates between a reasoning phase that generates and executes plans for the low-level VLA, and an assessment phase that reflects on the resulting execution and draws useful conclusions to be included in future reasoning contexts. Unlike similar approaches to self-refinement in non-robotics domains, LITEN must reflect on unstructured real-world robot trajectories (e.g., raw videos), which requires structured guiderails during assessment. Our experimental results demonstrate LITEN is able to effectively learn from past experience to generate plans that use high-affordance instructions to accomplish long-horizon tasks.