ArxivBench: Can LLMs Assist Researchers in Conducting Research?

๐Ÿ“… 2025-04-06
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Large language models (LLMs) frequently generate hallucinated arXiv URLs or cite non-existent papers in academic writing, severely undermining scholarly attribution reliability. To address this, we propose arXivBenchโ€”the first fine-grained, cross-disciplinary, and reproducible benchmark for academic literature retrieval, covering eight broad disciplines and five core subfields of computer science. Our test set is constructed from structured, site-wide arXiv metadata, and we introduce a high-level natural language prompting framework for response evaluation, jointly measuring relevance and URL correctness. Experiments reveal substantial disciplinary disparities in LLM performance (e.g., accuracy below 30% in some domains), with AI subfields achieving the highest scores; Claude-3.5-Sonnet attains the best overall performance. We fully open-source all data, prompts, and evaluation code, establishing a critical infrastructure for assessing LLM trustworthiness in scholarly applications.

Technology Category

Application Category

๐Ÿ“ Abstract
Large language models (LLMs) have demonstrated remarkable effectiveness in completing various tasks such as reasoning, translation, and question answering. However the issue of factual incorrect content in LLM-generated responses remains a persistent challenge. In this study, we evaluate both proprietary and open-source LLMs on their ability to respond with relevant research papers and accurate links to articles hosted on the arXiv platform, based on high level prompts. To facilitate this evaluation, we introduce arXivBench, a benchmark specifically designed to assess LLM performance across eight major subject categories on arXiv and five subfields within computer science, one of the most popular categories among them. Our findings reveal a concerning accuracy of LLM-generated responses depending on the subject, with some subjects experiencing significantly lower accuracy than others. Notably, Claude-3.5-Sonnet exhibits a substantial advantage in generating both relevant and accurate responses. And interestingly, most LLMs achieve a much higher accuracy in the Artificial Intelligence sub-field than other sub-fields. This benchmark provides a standardized tool for evaluating the reliability of LLM-generated scientific responses, promoting more dependable use of LLMs in academic and research environments. Our code is open-sourced at https://github.com/arxivBenchLLM/arXivBench and our dataset is available on huggingface at https://huggingface.co/datasets/arXivBenchLLM/arXivBench.
Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs on generating accurate arXiv research paper references
Assesses academic risks of incorrect links and non-existent papers
Benchmarks LLM performance across arXiv subjects and CS subfields
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for assessing LLM arXiv paper accuracy
Evaluates eight major arXiv subject categories
Standardized tool for LLM scientific reliability
๐Ÿ”Ž Similar Papers
No similar papers found.