🤖 AI Summary
To address low sample efficiency in web GUI navigation—caused by sparse rewards and a large, unstructured action space—this paper proposes the “Code-as-Generative-Actability” (CoGA) paradigm. CoGA leverages pretrained vision-language models (VLMs) to implicitly model user intent and employs generative program synthesis to dynamically produce executable action sets at each step, thereby enabling automatic, demonstration-free action-space reduction. The method integrates automated program verification with online reinforcement learning (RL) to filter invalid actions during training. On MiniWob++, CoGA achieves orders-of-magnitude higher sample efficiency than standard RL baselines. Moreover, it supports cross-task generalization and matches or exceeds behavior cloning performance under few-shot settings. The core innovation lies in formalizing actability as verifiable, intent-driven generative code—marking the first approach to achieve intent-conditioned, formally verifiable, and adaptive action-space contraction.
📝 Abstract
Agents that can autonomously navigate the web through a graphical user interface (GUI) using a unified action space (e.g., mouse and keyboard actions) can require very large amounts of domain-specific expert demonstrations to achieve good performance. Low sample efficiency is often exacerbated in sparse-reward and large-action-space environments, such as a web GUI, where only a few actions are relevant in any given situation. In this work, we consider the low-data regime, with limited or no access to expert behavior. To enable sample-efficient learning, we explore the effect of constraining the action space through $ extit{intent-based affordances}$ -- i.e., considering in any situation only the subset of actions that achieve a desired outcome. We propose $ extbf{Code as Generative Affordances}$ $( extbf{$ exttt{CoGA}$})$, a method that leverages pre-trained vision-language models (VLMs) to generate code that determines affordable actions through implicit intent-completion functions and using a fully-automated program generation and verification pipeline. These programs are then used in-the-loop of a reinforcement learning agent to return a set of affordances given a pixel observation. By greatly reducing the number of actions that an agent must consider, we demonstrate on a wide range of tasks in the MiniWob++ benchmark that: $ extbf{1)}$ $ exttt{CoGA}$ is orders of magnitude more sample efficient than its RL agent, $ extbf{2)}$ $ exttt{CoGA}$'s programs can generalize within a family of tasks, and $ extbf{3)}$ $ exttt{CoGA}$ performs better or on par compared with behavior cloning when a small number of expert demonstrations is available.