🤖 AI Summary
Existing vision–language–action (VLA) models lack effective spatial reasoning capabilities in complex scenes, hindering precise localization of interaction regions. This work proposes a unified framework that integrates task-conditioned affordances as an explicit visual planning interface within the VLA architecture: a learnable <AFF> token queries relevant interaction areas, which are decoded into affordance masks and compressed into compact embeddings to directly guide action generation. This approach achieves, for the first time, end-to-end alignment between localized, visually grounded, model-internal affordances and action prediction, establishing a joint perception–action optimization pathway. The method attains state-of-the-art performance on simulation benchmarks including LIBERO, LIBERO-Plus, and SimplerEnv, and demonstrates strong generalization to real-world tasks.
📝 Abstract
Vision-language-action (VLA) models have shown strong potential for generalist robot manipulation, yet they remain limited by insufficient spatial reasoning, particularly in determining where to interact in complex visual scenes. While recent efforts introduce various forms of visual planning to address this issue, existing approaches either rely on global geometric cues, symbolic intermediate representations, or externally generated visual signals, which are often weakly coupled with downstream action prediction. In this work, we revisit visual planning in VLA systems and argue that effective planning should be local, visually grounded, internally generated, and directly aligned with action. Based on this insight, we propose Afford-VLA, a unified framework that internalizes task-conditioned affordance as an explicit visual planning interface within VLA models. Concretely, we introduce learnable <AFF> tokens to query task-relevant interaction regions, decode affordance masks from multimodal features, and convert them into compact embeddings that directly condition action generation. This design enables affordance to be both generated and utilized within the VLA, forming a tightly coupled perception-action pathway. To further support this integration, we adopt a training strategy that allows the affordance pathway to be jointly optimized with action prediction, improving its effectiveness for downstream control. We evaluate our method on multiple simulation benchmarks, including LIBERO, LIBERO-Plus, and SimplerEnv, achieving consistent state-of-the-art performance, along with strong real-world results. These findings demonstrate that internalizing affordance as action-aligned visual planning provides a powerful paradigm for improving VLA systems.