LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs

📅 2024-09-03

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Current large language models (LLMs) exhibit poor coherence and weak instruction adherence in long-text generation—especially for sequences ≥16K tokens—severely limiting practical deployment. Method: We introduce LongGenBench, the first benchmark dedicated to evaluating long-text generation capability, spanning four application scenarios, three instruction types, and two length tiers (16K/32K tokens). It incorporates controllable generation tasks—including event triggering, constraint embedding, and structural control—and employs a hybrid human–automatic evaluation across coherence, instruction following, factual consistency, and length compliance. Contribution/Results: Evaluated on 10 state-of-the-art models, LongGenBench reveals a fundamental “understanding–generation gap”: models consistently underperform relative to understanding-focused benchmarks (e.g., Ruler), with performance deteriorating sharply as target length increases—empirically confirming an intrinsic bottleneck in LLMs’ long-text generation capacity.

Technology Category

Application Category

📝 Abstract

Current benchmarks like Needle-in-a-Haystack (NIAH), Ruler, and Needlebench focus on models' ability to understand long-context input sequences but fail to capture a critical dimension: the generation of high-quality long-form text. Applications such as design proposals, technical documentation, and creative writing rely on coherent, instruction-following outputs over extended sequences - a challenge that existing benchmarks do not adequately address. To fill this gap, we introduce LongGenBench, a novel benchmark designed to rigorously evaluate large language models' (LLMs) ability to generate long text while adhering to complex instructions. Through tasks requiring specific events or constraints within generated text, LongGenBench evaluates model performance across four distinct scenarios, three instruction types, and two generation-lengths (16K and 32K tokens). Our evaluation of ten state-of-the-art LLMs reveals that, despite strong results on Ruler, all models struggled with long text generation on LongGenBench, particularly as text length increased. This suggests that current LLMs are not yet equipped to meet the demands of real-world, long-form text generation.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Coherent Generation

Long Text

Innovation

Methods, ideas, or system contributions that make the work stand out.

LongGenBench

Long Article Generation

Large Language Models Evaluation

🔎 Similar Papers

No similar papers found.