BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese

📅 2025-04-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing web browsing benchmarks (e.g., BrowseComp) are heavily English-centric and fail to capture Chinese-specific challenges—including linguistic complexity, internet censorship mechanisms, and infrastructure heterogeneity. To address this gap, we introduce BrowseComp-ZH, the first high-difficulty, multi-hop web browsing benchmark tailored for Chinese environments. It comprises 289 reverse-engineered, objective-answer questions spanning 11 domains, with a two-stage quality control process ensuring both difficulty and answer uniqueness. We further develop an automated evaluation framework grounded in real Chinese web pages. Extensive experiments across 20+ state-of-the-art LLMs and intelligent search systems reveal severe limitations: most models achieve <10% accuracy, while only DeepResearch attains 42.9%. These results expose fundamental weaknesses in current LLMs’ capabilities for Chinese multi-hop retrieval, cross-page reasoning, and heterogeneous information fusion. All data, code, and construction guidelines are publicly released.

Technology Category

Application Category

📝 Abstract
As large language models (LLMs) evolve into tool-using agents, the ability to browse the web in real-time has become a critical yardstick for measuring their reasoning and retrieval competence. Existing benchmarks such as BrowseComp concentrate on English and overlook the linguistic, infrastructural, and censorship-related complexities of other major information ecosystems -- most notably Chinese. To address this gap, we introduce BrowseComp-ZH, a high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web. BrowseComp-ZH consists of 289 multi-hop questions spanning 11 diverse domains. Each question is reverse-engineered from a short, objective, and easily verifiable answer (e.g., a date, number, or proper noun). A two-stage quality control protocol is applied to strive for high question difficulty and answer uniqueness. We benchmark over 20 state-of-the-art language models and agentic search systems on our proposed BrowseComp-ZH. Despite their strong conversational and retrieval capabilities, most models struggle severely: a large number achieve accuracy rates below 10%, and only a handful exceed 20%. Even the best-performing system, OpenAI's DeepResearch, reaches just 42.9%. These results demonstrate the considerable difficulty of BrowseComp-ZH, where success demands not only effective retrieval strategies, but also sophisticated reasoning and information reconciliation -- capabilities that current models still struggle to master. Our dataset, construction guidelines, and benchmark results have been publicly released at https://github.com/PALIN2018/BrowseComp-ZH.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' web browsing ability in Chinese
Addressing lack of benchmarks for non-English web ecosystems
Assessing multi-hop reasoning and retrieval in complex domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing BrowseComp-ZH for Chinese web evaluation
Multi-hop questions across 11 diverse domains
Two-stage quality control for high difficulty
🔎 Similar Papers
No similar papers found.
Peilin Zhou
Peilin Zhou
HKUST; Peking University
sequential recommendationnatural language processing
B
Bruce Leon
Peking University
Xiang Ying
Xiang Ying
co-founder @ Mindverse
AILLMGNNGraph Data Mining
C
Can Zhang
Peking University
Y
Yifan Shao
Qichen Ye
Qichen Ye
Peking University
Natural Language ProcessingRecommendation System
Dading Chong
Dading Chong
Peking university
Multidal representationLarge language modelMultimodal recommendation
Z
Zhiling Jin
Zhejiang University
Chenxuan Xie
Chenxuan Xie
Zhejiang University of Technology
graphdata mining
Meng Cao
Meng Cao
Postdoc, Carnegie Mellon University
Psychology
Y
Yuxin Gu
NIO
S
Sixin Hong
Peking University
J
Jing Ren
Peking University
J
Jian Chen
Hong Kong University of Science and Technology (Guangzhou), HSBC
C
Chao Liu
Peking University
Y
Yining Hua
Harvard T.H. Chan School of Public Health