BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

๐Ÿ“… 2025-04-16
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing web browsing agents lack rigorous evaluation of their ability to perform persistent, creative navigation in realistic web environments to retrieve deeply embedded, interwoven information. Method: We introduce BrowseComp, a benchmark comprising 1,266 complex, multi-step navigation tasks grounded in real HTTP requests, DOM parsing, and a standardized action spaceโ€”enabling end-to-end agent evaluation. Each task yields a concise, verifiable short answer, enabling quantitative assessment of browsing persistence and exploration strategies for the first time. Contribution/Results: BrowseComp establishes a reproducible, comparable, and focused evaluation paradigm for complex information retrieval on the web. By open-sourcing the benchmark (GitHub: openai/simple-evals), it significantly advances the rigor and standardization of web agent evaluation, providing a new foundation for measuring intelligent browsing capabilities.

Technology Category

Application Category

๐Ÿ“ Abstract
We present BrowseComp, a simple yet challenging benchmark for measuring the ability for agents to browse the web. BrowseComp comprises 1,266 questions that require persistently navigating the internet in search of hard-to-find, entangled information. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers. BrowseComp for browsing agents can be seen as analogous to how programming competitions are an incomplete but useful benchmark for coding agents. While BrowseComp sidesteps challenges of a true user query distribution, like generating long answers or resolving ambiguity, it measures the important core capability of exercising persistence and creativity in finding information. BrowseComp can be found at https://github.com/openai/simple-evals.
Problem

Research questions and friction points this paper is trying to address.

Measure web browsing agents' ability to find hard-to-find information
Provide a simple benchmark with short, verifiable answers
Assess persistence and creativity in navigating the internet
Innovation

Methods, ideas, or system contributions that make the work stand out.

BrowseComp benchmark for web browsing agents
1,266 hard-to-find information questions
Short verifiable answers for easy evaluation
๐Ÿ”Ž Similar Papers
No similar papers found.
J
Jason Wei
OpenAI
Zhiqing Sun
Zhiqing Sun
OpenAI
Machine LearningLanguage ModellingAI Alignment
S
Spencer Papay
OpenAI
S
Scott McKinney
OpenAI
J
Jeffrey Han
OpenAI
I
Isa Fulford
OpenAI
Hyung Won Chung
Hyung Won Chung
OpenAI
deep learningmachine learninglarge language modelsgame-changing research
A
Alex Tachard Passos
OpenAI
William Fedus
William Fedus
OpenAI
Artificial IntelligenceMachine Learning
A
Amelia Glaese
OpenAI