Afford-VLA: Action-Aligned Visual Planning via Internalized Affordance

📅 2026-05-22

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Existing vision–language–action (VLA) models lack effective spatial reasoning capabilities in complex scenes, hindering precise localization of interaction regions. This work proposes a unified framework that integrates task-conditioned affordances as an explicit visual planning interface within the VLA architecture: a learnable <AFF> token queries relevant interaction areas, which are decoded into affordance masks and compressed into compact embeddings to directly guide action generation. This approach achieves, for the first time, end-to-end alignment between localized, visually grounded, model-internal affordances and action prediction, establishing a joint perception–action optimization pathway. The method attains state-of-the-art performance on simulation benchmarks including LIBERO, LIBERO-Plus, and SimplerEnv, and demonstrates strong generalization to real-world tasks.

📝 Abstract

Vision-language-action (VLA) models have shown strong potential for generalist robot manipulation, yet they remain limited by insufficient spatial reasoning, particularly in determining where to interact in complex visual scenes. While recent efforts introduce various forms of visual planning to address this issue, existing approaches either rely on global geometric cues, symbolic intermediate representations, or externally generated visual signals, which are often weakly coupled with downstream action prediction. In this work, we revisit visual planning in VLA systems and argue that effective planning should be local, visually grounded, internally generated, and directly aligned with action. Based on this insight, we propose Afford-VLA, a unified framework that internalizes task-conditioned affordance as an explicit visual planning interface within VLA models. Concretely, we introduce learnable <AFF> tokens to query task-relevant interaction regions, decode affordance masks from multimodal features, and convert them into compact embeddings that directly condition action generation. This design enables affordance to be both generated and utilized within the VLA, forming a tightly coupled perception-action pathway. To further support this integration, we adopt a training strategy that allows the affordance pathway to be jointly optimized with action prediction, improving its effectiveness for downstream control. We evaluate our method on multiple simulation benchmarks, including LIBERO, LIBERO-Plus, and SimplerEnv, achieving consistent state-of-the-art performance, along with strong real-world results. These findings demonstrate that internalizing affordance as action-aligned visual planning provides a powerful paradigm for improving VLA systems.

Problem

Research questions and friction points this paper is trying to address.

visual planning

spatial reasoning

affordance

vision-language-action

robot manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

affordance

visual planning

vision-language-action