🤖 AI Summary
This work addresses the challenge of efficient exploration for browser agents in exponentially large action spaces under fully unsupervised conditions. Methodologically, it introduces the first end-to-end training framework that requires no human annotations, featuring an automatic trajectory pruning mechanism grounded in the hierarchical structure of natural language instructions. The approach integrates exploration-policy sampling, inverse-action labeling, instruction decomposition guidance, and fine-tuning of the LLaMA-3.1-8B model. Its core innovation lies in explicitly leveraging instruction semantics to inform trajectory filtering and subtask decomposition, thereby substantially improving both exploration efficiency and labelability. Evaluated on WebArena and WebVoyager benchmarks, the method achieves task success rates of 16% and 35%, respectively—outperforming zero-shot LLaMA by +15 and +31 percentage points—and establishes the new state-of-the-art for unsupervised browser agents.
📝 Abstract
We introduce NNetNav, a method for unsupervised interaction with websites that generates synthetic demonstrations for training browser agents. Given any website, NNetNav produces these demonstrations by retroactively labeling action sequences from an exploration policy. Most work on training browser agents has relied on expensive human supervision, and the limited prior work on such interaction-based techniques has failed to provide effective search through the exponentially large space of exploration. In contrast, NNetNav exploits the hierarchical structure of language instructions to make this search more tractable: Complex instructions are typically decomposable into simpler sub-tasks, allowing NNetNav to automatically prune interaction episodes when an intermediate trajectory cannot be annotated with a meaningful sub-task. exttt{LLama-3.1-8b} finetuned on 10k NNetNav self-generated demonstrations obtains over 16% success rate on WebArena, and 35% on WebVoyager, an improvement of 15pts and 31pts respectively over zero-shot exttt{LLama-3.1-8b}, outperforming zero-shot GPT-4 and reaching the state-of-the-art among unsupervised methods, for both benchmarks.