LIFEBench: Evaluating Length Instruction Following in Large Language Models

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Large language models (LLMs) frequently under-generate, truncate prematurely, or refuse to comply with explicit length instructions (e.g., “generate a 10,000-word novel”), yet existing benchmarks lack systematic evaluation of length-following capability. Method: We introduce LIFEBench—the first dedicated benchmark for length instruction following—covering output lengths from 16 to 8,192 words, multiple tasks, and bilingual (English–Chinese) settings. It comprises 10,800 human-crafted instructions, with rigorous automated length verification and human evaluation. Contribution/Results: Evaluation across 26 LLMs reveals: (1) All mainstream models exhibit sharp performance degradation on long-output tasks; none achieve their vendor-specified maximum generation length. (2) Reasoning-oriented models significantly outperform long-text-specialized models. (3) Extended context window capacity does not improve length control accuracy. These findings expose a fundamental deficiency in controllable text generation, establishing a new evaluation paradigm and concrete directions for improvement.

Technology Category

Application Category

📝 Abstract

While large language models (LLMs) can solve PhD-level reasoning problems over long context inputs, they still struggle with a seemingly simpler task: following explicit length instructions-e.g., write a 10,000-word novel. Additionally, models often generate far too short outputs, terminate prematurely, or even refuse the request. Existing benchmarks focus primarily on evaluating generations quality, but often overlook whether the generations meet length constraints. To this end, we introduce Length Instruction Following Evaluation Benchmark (LIFEBench) to comprehensively evaluate LLMs' ability to follow length instructions across diverse tasks and a wide range of specified lengths. LIFEBench consists of 10,800 instances across 4 task categories in both English and Chinese, covering length constraints ranging from 16 to 8192 words. We evaluate 26 widely-used LLMs and find that most models reasonably follow short-length instructions but deteriorate sharply beyond a certain threshold. Surprisingly, almost all models fail to reach the vendor-claimed maximum output lengths in practice, as further confirmed by our evaluations extending up to 32K words. Even long-context LLMs, despite their extended input-output windows, counterintuitively fail to improve length-instructions following. Notably, Reasoning LLMs outperform even specialized long-text generation models, achieving state-of-the-art length following. Overall, LIFEBench uncovers fundamental limitations in current LLMs' length instructions following ability, offering critical insights for future progress.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to follow explicit length instructions

Assessing model performance across diverse tasks and lengths

Identifying limitations in current LLMs' length instruction compliance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces LIFEBench for length instruction evaluation

Tests 26 LLMs across diverse tasks and lengths

Reveals LLMs' limitations in following length instructions

🔎 Similar Papers

No similar papers found.