Agentic Scene Policies: Unifying Space, Semantics, and Affordances for Robot Action

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Existing imitation learning and vision-language-action (VLA) models exhibit limited generalization to open-vocabulary instructions, complex semantics, and novel environments. To address this, we propose Agentic Scene Policies (ASP), a framework that constructs a unified, queryable scene representation integrating spatial layout, semantic understanding, and affordance-aware operability—serving as a language-grounded interface for robotic decision-making. ASP introduces an affordance-driven reasoning mechanism enabling zero-shot open-vocabulary comprehension and supporting both navigation and manipulation tasks. By unifying imitation learning, VLA modeling, and modern neural scene representations, ASP achieves end-to-end language-to-action mapping. Experiments demonstrate that ASP outperforms state-of-the-art VLA models on tabletop manipulation benchmarks and, for the first time, enables natural language–driven cross-room navigation and object manipulation in real-world room-scale environments—validating its strong generalization capability and scalability.

Technology Category

Application Category

📝 Abstract

Executing open-ended natural language queries is a core problem in robotics. While recent advances in imitation learning and vision-language-actions models (VLAs) have enabled promising end-to-end policies, these models struggle when faced with complex instructions and new scenes. An alternative is to design an explicit scene representation as a queryable interface between the robot and the world, using query results to guide downstream motion planning. In this work, we present Agentic Scene Policies (ASP), an agentic framework that leverages the advanced semantic, spatial, and affordance-based querying capabilities of modern scene representations to implement a capable language-conditioned robot policy. ASP can execute open-vocabulary queries in a zero-shot manner by explicitly reasoning about object affordances in the case of more complex skills. Through extensive experiments, we compare ASP with VLAs on tabletop manipulation problems and showcase how ASP can tackle room-level queries through affordance-guided navigation, and a scaled-up scene representation. (Project page: https://montrealrobotics.ca/agentic-scene-policies.github.io/)

Problem

Research questions and friction points this paper is trying to address.

Executing complex open-ended natural language robot instructions

Addressing limitations of end-to-end models in new scenes

Enabling zero-shot reasoning about object affordances for manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic framework leveraging semantic spatial affordance queries

Zero-shot execution of open-vocabulary instructions through reasoning

Scalable scene representation combining manipulation and navigation

🔎 Similar Papers

No similar papers found.