Let's Use ChatGPT To Write Our Paper! Benchmarking LLMs To Write the Introduction of a Research Paper

📅 2025-08-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper introduces the Scientific Introduction Generation (SciIG) task to evaluate large language models’ (LLMs) capability to generate high-quality academic introductions from paper titles, abstracts, and related work. To this end, we propose the first multi-dimensional evaluation framework specifically designed for scientific introduction generation, incorporating lexical overlap, semantic similarity, content coverage, and factual fidelity—assessed via both automated metrics and LLM-as-a-judge. Experiments benchmark state-of-the-art models—including DeepSeek-v3, Gemma-3-12B, LLaMA-4-Maverick, MistralAI Small 3.1, and GPT-4o—under few-shot prompting. Results show that three-shot prompting significantly improves generation quality, and LLaMA-4-Maverick achieves the best performance in semantic coherence and factual consistency. All code and datasets will be publicly released.

Technology Category

Application Category

📝 Abstract
As researchers increasingly adopt LLMs as writing assistants, generating high-quality research paper introductions remains both challenging and essential. We introduce Scientific Introduction Generation (SciIG), a task that evaluates LLMs' ability to produce coherent introductions from titles, abstracts, and related works. Curating new datasets from NAACL 2025 and ICLR 2025 papers, we assess five state-of-the-art models, including both open-source (DeepSeek-v3, Gemma-3-12B, LLaMA 4-Maverick, MistralAI Small 3.1) and closed-source GPT-4o systems, across multiple dimensions: lexical overlap, semantic similarity, content coverage, faithfulness, consistency, citation correctness, and narrative quality. Our comprehensive framework combines automated metrics with LLM-as-a-judge evaluations. Results demonstrate LLaMA-4 Maverick's superior performance on most metrics, particularly in semantic similarity and faithfulness. Moreover, three-shot prompting consistently outperforms fewer-shot approaches. These findings provide practical insights into developing effective research writing assistants and set realistic expectations for LLM-assisted academic writing. To foster reproducibility and future research, we will publicly release all code and datasets.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to generate coherent research paper introductions
Assessing model performance across lexical, semantic, and narrative quality metrics
Developing effective prompting strategies for research writing assistance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scientific Introduction Generation task design
Multi-dimensional evaluation combining automated and LLM-judge metrics
Three-shot prompting strategy for superior performance
🔎 Similar Papers
No similar papers found.