WorldAPIs: The World Is Worth How Many APIs? A Thought Experiment

📅 2024-07-10

🏛️ AAAI Conference on Artificial Intelligence

📈 Citations: 3

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Embodied agents operating in physical environments face a fundamental “action poverty” issue: existing simulators provide only limited, domain-specific APIs, hindering cross-task generalization due to insufficiently expressive primitive action definitions. Method: We propose the first cross-domain API universe bootstrapped from human-authored tutorials (wikiHow), systematically inducing minimal, executable primitives by mapping tutorial instructions to grounded policies. Our iterative API induction framework integrates few-shot prompting with GPT-4, Python program generation, instruction-action alignment, and contextual policy grounding. Contribution/Results: Applied to just 0.5% of wikiHow, our method induces over 300 high-frequency, semantically grounded APIs. Human evaluation reveals that leading embodied simulators cover only nine of the top-50 induced APIs—quantifying for the first time the severity of action-space sparsity. This work establishes a foundational benchmark and design paradigm for principled primitive action space construction in embodied AI.

Technology Category

Application Category

📝 Abstract

AI systems make decisions in physical environments through primitive actions or affordances that are accessed via API calls. While deploying AI agents in the real world involves numerous high-level actions, existing embodied simulators offer a limited set of domain-salient APIs. This naturally brings up the questions: how many primitive actions (APIs) are needed for a versatile embodied agent, and how should they look like? We explore this via a thought experiment: assuming that wikiHow tutorials cover a wide variety of human-written tasks, what is the space of APIs needed to cover these instructions? We propose a framework to iteratively induce new APIs by grounding wikiHow instruction to situated agent policies. Inspired by recent successes in large language models (LLMs) for embodied planning, we propose a few-shot prompting to steer GPT-4 to generate Pythonic programs as agent policies and bootstrap a universe of APIs by 1) reusing a seed set of APIs; and then 2) fabricate new API calls when necessary. The focus of this thought experiment is on defining these APIs rather than their excitability. We apply the proposed pipeline on instructions from wiki- How tutorials. On a small fraction (0.5%) of tutorials, we induce an action space of 300+ APIs necessary for capturing the rich variety of tasks in the physical world. A detailed automatic and human analysis of the induction output reveals that the proposed pipeline enables effective reuse and creation of APIs. Moreover, a manual review revealed that existing simulators support only a small subset of the induced APIs (9 of the top 50 frequent APIs), motivating the development of action-rich embodied environments.

Problem

Research questions and friction points this paper is trying to address.

Determine the number and design of primitive APIs needed for versatile embodied AI agents.

Develop a framework to iteratively create APIs from human-written task instructions.

Assess the gap between induced APIs and existing simulator capabilities.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using wikiHow tutorials to define versatile agent APIs

Few-shot prompting GPT-4 to generate Pythonic agent policies

Iteratively inducing new APIs by grounding instructions

🔎 Similar Papers

Can't Hide Behind the API: Stealing Black-Box Commercial Embedding Models