🤖 AI Summary
Existing imitation learning and vision-language-action (VLA) models exhibit limited generalization to open-vocabulary instructions, complex semantics, and novel environments. To address this, we propose Agentic Scene Policies (ASP), a framework that constructs a unified, queryable scene representation integrating spatial layout, semantic understanding, and affordance-aware operability—serving as a language-grounded interface for robotic decision-making. ASP introduces an affordance-driven reasoning mechanism enabling zero-shot open-vocabulary comprehension and supporting both navigation and manipulation tasks. By unifying imitation learning, VLA modeling, and modern neural scene representations, ASP achieves end-to-end language-to-action mapping. Experiments demonstrate that ASP outperforms state-of-the-art VLA models on tabletop manipulation benchmarks and, for the first time, enables natural language–driven cross-room navigation and object manipulation in real-world room-scale environments—validating its strong generalization capability and scalability.
📝 Abstract
Executing open-ended natural language queries is a core problem in robotics. While recent advances in imitation learning and vision-language-actions models (VLAs) have enabled promising end-to-end policies, these models struggle when faced with complex instructions and new scenes. An alternative is to design an explicit scene representation as a queryable interface between the robot and the world, using query results to guide downstream motion planning. In this work, we present Agentic Scene Policies (ASP), an agentic framework that leverages the advanced semantic, spatial, and affordance-based querying capabilities of modern scene representations to implement a capable language-conditioned robot policy. ASP can execute open-vocabulary queries in a zero-shot manner by explicitly reasoning about object affordances in the case of more complex skills. Through extensive experiments, we compare ASP with VLAs on tabletop manipulation problems and showcase how ASP can tackle room-level queries through affordance-guided navigation, and a scaled-up scene representation. (Project page: https://montrealrobotics.ca/agentic-scene-policies.github.io/)