Are LLMs Good Literature Review Writers? Evaluating the Literature Review Writing Ability of Large Language Models

📅 2024-12-18

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Large language models (LLMs) are increasingly deployed for academic literature review writing, yet their reliability in generating accurate citations, coherent summaries, and factually consistent content remains poorly quantified. Method: We propose the first multidimensional automated evaluation framework specifically designed for literature reviews, integrating external retrieval, semantic similarity modeling, factual verification modules, and a cross-disciplinary benchmark dataset to quantify hallucination rates, disciplinary variability, and contextual factual consistency. Contribution/Results: Empirical evaluation across major LLMs reveals pervasive citation hallucination, strong disciplinary dependence in performance, and—critically—a substantial gap between even the best-performing model’s factual consistency and human-level accuracy. The framework exposes fundamental limitations in current LLMs’ scholarly citation reliability and establishes a reproducible, rigorous evaluation paradigm for trustworthy academic generation.

Technology Category

Application Category

📝 Abstract

The literature review is a crucial form of academic writing that involves complex processes of literature collection, organization, and summarization. The emergence of large language models (LLMs) has introduced promising tools to automate these processes. However, their actual capabilities in writing comprehensive literature reviews remain underexplored, such as whether they can generate accurate and reliable references. To address this gap, we propose a framework to assess the literature review writing ability of LLMs automatically. We evaluate the performance of LLMs across three tasks: generating references, writing abstracts, and writing literature reviews. We employ external tools for a multidimensional evaluation, which includes assessing hallucination rates in references, semantic coverage, and factual consistency with human-written context. By analyzing the experimental results, we find that, despite advancements, even the most sophisticated models still cannot avoid generating hallucinated references. Additionally, different models exhibit varying performance in literature review writing across different disciplines.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Literature Review

Accuracy Assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models

Literature Review Generation

Cross-Disciplinary Performance Evaluation

🔎 Similar Papers

LitLLM: A Toolkit for Scientific Literature Review