ResearchArena: Benchmarking Large Language Models' Ability to Collect and Organize Information as Research Agents

📅 2024-06-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the limited capability of large language models (LLMs) in academic literature survey tasks—specifically, information discovery, filtering, and structured organization. To this end, we introduce ResearchArena, the first offline benchmark systematically covering three core stages:文献 retrieval, cross-document relevance assessment, and knowledge structuring (e.g., mind map generation). Built upon the Semantic Scholar open corpus (12M full-text papers), our evaluation paradigm integrates retrieval augmentation, multi-step reasoning, and structured output generation. Key contributions include: (1) the first end-to-end modeling of academic survey workflows; (2) a hierarchical evaluation framework wherein mind map generation serves as an optional high-order capability metric; and (3) empirical findings demonstrating that current LLMs substantially underperform keyword-based baselines, exposing fundamental deficiencies in cross-document analysis, domain discrimination, and knowledge integration.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) excel across many natural language processing tasks but face challenges in domain-specific, analytical tasks such as conducting research surveys. This study introduces ResearchArena, a benchmark designed to evaluate LLMs' capabilities in conducting academic surveys$unicode{x2013}$a foundational step in academic research. ResearchArena models the process in three stages: (1) information discovery, identifying relevant literature; (2) information selection, evaluating papers' relevance and impact; and (3) information organization, structuring knowledge into hierarchical frameworks such as mind-maps. Notably, mind-map construction is treated as a bonus task, reflecting its supplementary role in survey-writing. To support these evaluations, we construct an offline environment of 12M full-text academic papers and 7.9K survey papers. To ensure ethical compliance, we do not redistribute copyrighted materials; instead, we provide code to construct the environment from the Semantic Scholar Open Research Corpus (S2ORC). Preliminary evaluations reveal that LLM-based approaches underperform compared to simpler keyword-based retrieval methods, underscoring significant opportunities for advancing LLMs in autonomous research.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs in academic survey tasks
Assessing information discovery and organization capabilities
Benchmarking LLMs against keyword-based retrieval methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark evaluates LLMs' research survey capabilities
Stages include discovery, selection, and organization
Utilizes Semantic Scholar Open Research Corpus
🔎 Similar Papers
No similar papers found.