Learning Affordances at Inference-Time for Vision-Language-Action Models

📅 2025-10-22

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current vision-language-action (VLA) models lack the capability to dynamically correct behaviors following task failures in complex robotic control. This paper introduces LITEN, the first framework enabling real-robot trajectory self-correction at inference time. LITEN establishes a closed-loop “reason–act–evaluate” architecture by tightly coupling vision-language models (VLMs) with low-level control policies. It incorporates a trajectory-based reflection mechanism and a structured video feedback extraction method to support context-aware, iterative planning. Crucially, LITEN bridges high-level language reasoning with low-level action execution and leverages historical execution traces to generate highly feasible, executable instructions. Experiments demonstrate substantial improvements in success rates for long-horizon robotic tasks. By endowing VLAs with human-like online reflection and adaptive behavior revision, LITEN advances the frontier of embodied AI reasoning and control.

Technology Category

Application Category

📝 Abstract

Solving complex real-world control tasks often takes multiple tries: if we fail at first, we reflect on what went wrong, and change our strategy accordingly to avoid making the same mistake. In robotics, Vision-Language-Action models (VLAs) offer a promising path towards solving complex control tasks, but lack the ability to contextually and dynamically readjust behavior when they fail to accomplish a task. In this work, we introduce Learning from Inference-Time Execution (LITEN), which connects a VLA low-level policy to a high-level VLM that conditions on past experiences by including them in-context, allowing it to learn the affordances and capabilities of the low-level VLA. Our approach iterates between a reasoning phase that generates and executes plans for the low-level VLA, and an assessment phase that reflects on the resulting execution and draws useful conclusions to be included in future reasoning contexts. Unlike similar approaches to self-refinement in non-robotics domains, LITEN must reflect on unstructured real-world robot trajectories (e.g., raw videos), which requires structured guiderails during assessment. Our experimental results demonstrate LITEN is able to effectively learn from past experience to generate plans that use high-affordance instructions to accomplish long-horizon tasks.

Problem

Research questions and friction points this paper is trying to address.

Enabling robots to dynamically adjust behavior after task failures

Connecting high-level reasoning with low-level robot execution policies

Learning affordances from unstructured real-world robot execution trajectories

Innovation

Methods, ideas, or system contributions that make the work stand out.

Connects VLA policy to high-level VLM

Iterates between reasoning and assessment phases

Learns affordances from past execution experiences

🔎 Similar Papers

No similar papers found.

Authors to Follow