AcademicBrowse: Benchmarking Academic Browse Ability of LLMs

📅 2025-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM evaluation benchmarks (e.g., BrowseComp) emphasize general-purpose web search and lack systematic assessment of core academic retrieval capabilities—namely, citation tracing, domain-specific database navigation, long-tail scholarly knowledge acquisition, and academic rigor. Method: We introduce the first LLM benchmark dedicated to academic retrieval, formally defining “academic browsing ability” along four dimensions: academic utility, high difficulty, concise verifiability, and cross-disciplinary breadth. Our benchmark comprises a question-answering dataset grounded in 15+ real-world research tasks; each question requires ≥3-step deep search for resolution. We further incorporate unique-answer generation, source-provenance annotation, and explicit search-path constraints. Contribution/Results: This benchmark substantially enhances the measurability and comparability of academic information retrieval performance. All data, annotations, and evaluation protocols are publicly released.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs)' search capabilities have garnered significant attention. Existing benchmarks, such as OpenAI's BrowseComp, primarily focus on general search scenarios and fail to adequately address the specific demands of academic search. These demands include deeper literature tracing and organization, professional support for academic databases, the ability to navigate long-tail academic knowledge, and ensuring academic rigor. Here, we proposed AcademicBrowse, the first dataset specifically designed to evaluate the complex information retrieval capabilities of Large Language Models (LLMs) in academic research. AcademicBrowse possesses the following key characteristics: Academic Practicality, where question content closely mirrors real academic learning and research environments, avoiding deliberately misleading models; High Difficulty, with answers that are challenging for single models (e.g., Grok DeepSearch or Gemini Deep Research) to provide directly, often requiring at least three deep searches to derive; Concise Evaluation, where limiting conditions ensure answers are as unique as possible, accompanied by clear sources and brief solution explanations, greatly facilitating subsequent audit and verification, surpassing the current lack of analyzed search datasets both domestically and internationally; and Broad Coverage, as the dataset spans at least 15 different academic disciplines. Through AcademicBrowse, we expect to more precisely measure and promote the performance improvement of LLMs in complex academic information retrieval tasks. The data is available at: https://huggingface.co/datasets/PKU-DS-LAB/AcademicBrowse
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' academic search capabilities in complex scenarios
Addressing gaps in existing benchmarks for academic research demands
Providing a specialized dataset for rigorous academic retrieval tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Specialized dataset for academic search evaluation
High difficulty requiring multiple deep searches
Broad coverage across 15 academic disciplines
🔎 Similar Papers
No similar papers found.
Junting Zhou
Junting Zhou
Peking University
Large Language ModelAI for ScienceBioinformatics
W
Wang Li
Peking University, Beijing, China
Y
Yiyan Liao
Peking University, Beijing, China
N
Nengyuan Zhang
Peking University, Beijing, China
T
Tingjia Miaoand Zhihui Qi
Peking University, Beijing, China
Yuhan Wu
Yuhan Wu
Peking University, Ph.D. student in CS, yuhan.wu [at] pku.edu.cn My Chinese name is 吴钰晗
Data StructuresNetworkingBig Data
T
Tong Yang
Peking University, Beijing, China