AnySlot: Goal-Conditioned Vision-Language-Action Policies for Zero-Shot Slot-Level Placement

📅 2026-04-11

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the challenge that general-purpose vision-language-action (VLA) policies struggle to achieve high-precision, slot-level object placement under compositional language instructions. To overcome this limitation, the authors propose AnySlot, a hierarchical framework that introduces explicit spatial visual goals as an intermediate representation between language and control. The framework parses instructions into scene tokens, which are then executed by a goal-conditioned VLA policy, enabling semantically accurate and spatially robust hierarchical control. By decoupling high-level slot selection from low-level action execution, AnySlot facilitates structured spatial reasoning. The authors also introduce SlotBench, the first simulation benchmark specifically designed for slot-level placement tasks. Experimental results demonstrate that AnySlot significantly outperforms end-to-end VLA baselines and existing modular approaches in zero-shot settings, showcasing its superior capability in compositional spatial reasoning.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) policies have emerged as a versatile paradigm for generalist robotic manipulation. However, precise object placement under compositional language instructions remains a major challenge for modern monolithic VLA policies. Slot-level tasks require both reliable slot grounding and sub-centimeter execution accuracy. To this end, we propose AnySlot, a framework that reduces compositional complexity by introducing an explicit spatial visual goal as an intermediate representation between language grounding and control. AnySlot turns language into an explicit visual goal by generating a scene marker, then executes this goal with a goal-conditioned VLA policy. This hierarchical design effectively decouples high-level slot selection from low-level execution, ensuring both semantic accuracy and spatial robustness. Furthermore, recognizing the lack of existing benchmarks for such precision-demanding tasks, we introduce SlotBench, a comprehensive simulation benchmark featuring nine task categories tailored to evaluate structured spatial reasoning in slot-level placement. Extensive experiments show that AnySlot significantly outperforms flat VLA baselines and previous modular grounding methods in zero-shot slot-level placement.

Problem

Research questions and friction points this paper is trying to address.

slot-level placement

compositional language instructions

visual-language-action policies

spatial reasoning

zero-shot manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

goal-conditioned VLA

slot-level placement

visual goal representation