Can LLMs Generate Tabular Summaries of Science Papers? Rethinking the Evaluation Protocol

📅 2025-04-14

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This study addresses the high-value yet challenging task of automatically generating literature review tables for scientific papers using large language models (LLMs), confronting real-world bottlenecks including ambiguous user prompts, noisy candidate papers, and overreliance on superficial text similarity in existing evaluations. To tackle these issues, we propose ARXIV2TABLE—the first high-quality benchmark dataset specifically designed for table generation—and introduce a utility-oriented evaluation paradigm that measures table effectiveness in downstream comparative analysis tasks. Methodologically, our end-to-end pipeline integrates open- and closed-source LLM inference, human verification, and task-driven post-processing. Experiments reveal substantial performance gaps among state-of-the-art LLMs, confirming the task’s difficulty. All data and code are publicly released to advance practical, utility-driven research in scientific summarization.

Technology Category

Application Category

📝 Abstract

Literature review tables are essential for summarizing and comparing collections of scientific papers. We explore the task of generating tables that best fulfill a user's informational needs given a collection of scientific papers. Building on recent work (Newman et al., 2024), we extend prior approaches to address real-world complexities through a combination of LLM-based methods and human annotations. Our contributions focus on three key challenges encountered in real-world use: (i) User prompts are often under-specified; (ii) Retrieved candidate papers frequently contain irrelevant content; and (iii) Task evaluation should move beyond shallow text similarity techniques and instead assess the utility of inferred tables for information-seeking tasks (e.g., comparing papers). To support reproducible evaluation, we introduce ARXIV2TABLE, a more realistic and challenging benchmark for this task, along with a novel approach to improve literature review table generation in real-world scenarios. Our extensive experiments on this benchmark show that both open-weight and proprietary LLMs struggle with the task, highlighting its difficulty and the need for further advancements. Our dataset and code are available at https://github.com/JHU-CLSP/arXiv2Table.

Problem

Research questions and friction points this paper is trying to address.

Generating tables summarizing scientific papers effectively

Addressing under-specified user prompts in table generation

Improving evaluation of table utility for information-seeking tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines LLM-based methods with human annotations

Introduces ARXIV2TABLE benchmark for evaluation

Addresses under-specified prompts and irrelevant content

🔎 Similar Papers

Rethinking Scientific Summarization Evaluation: Grounding Explainable Metrics on Facet-aware Benchmark