🤖 AI Summary
Embodied agents operating in physical environments face a fundamental “action poverty” issue: existing simulators provide only limited, domain-specific APIs, hindering cross-task generalization due to insufficiently expressive primitive action definitions.
Method: We propose the first cross-domain API universe bootstrapped from human-authored tutorials (wikiHow), systematically inducing minimal, executable primitives by mapping tutorial instructions to grounded policies. Our iterative API induction framework integrates few-shot prompting with GPT-4, Python program generation, instruction-action alignment, and contextual policy grounding.
Contribution/Results: Applied to just 0.5% of wikiHow, our method induces over 300 high-frequency, semantically grounded APIs. Human evaluation reveals that leading embodied simulators cover only nine of the top-50 induced APIs—quantifying for the first time the severity of action-space sparsity. This work establishes a foundational benchmark and design paradigm for principled primitive action space construction in embodied AI.
📝 Abstract
AI systems make decisions in physical environments through primitive actions or affordances that are accessed via API calls. While deploying AI agents in the real world involves numerous high-level actions, existing embodied simulators offer a limited set of domain-salient APIs. This naturally brings up the questions: how many primitive actions (APIs) are needed for a versatile embodied agent, and how should they look like?
We explore this via a thought experiment: assuming that wikiHow tutorials cover a wide variety of human-written tasks, what is the space of APIs needed to cover these instructions? We propose a framework to iteratively induce new APIs by grounding wikiHow instruction to situated agent policies. Inspired by recent successes in large language models (LLMs) for embodied planning, we propose a few-shot prompting to steer GPT-4 to generate Pythonic programs as agent policies and bootstrap a universe of APIs by 1) reusing a seed set of APIs; and then 2) fabricate new API calls when necessary. The focus of this thought experiment is on defining these APIs rather than their excitability.
We apply the proposed pipeline on instructions from wiki- How tutorials. On a small fraction (0.5%) of tutorials, we induce an action space of 300+ APIs necessary for capturing the rich variety of tasks in the physical world. A detailed automatic and human analysis of the induction output reveals that the proposed pipeline enables effective reuse and creation of APIs. Moreover, a manual review revealed that existing simulators support only a small subset of the induced APIs (9 of the top 50 frequent APIs), motivating the development of action-rich embodied environments.