On Stable Long-Form Generation: Benchmarking and Mitigating Length Volatility

📅 2026-05-02

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This study addresses the significant instability in output length exhibited by large language models during long-text generation, which undermines both generation consistency and practical reliability. The authors present the first systematic quantification of this issue, introducing VOLTBench—a heterogeneous benchmark spanning diverse tasks—and uncover its underlying mechanisms through attention trajectory analysis. Building on these insights, they propose GLoBo, a lightweight, training-free decoding strategy that enhances length controllability via logits modulation. Experimental results demonstrate that GLoBo increases average output length by 148% and reduces length variance by 69% across mainstream models, all while preserving high-quality text generation.

📝 Abstract

Large Language Models (LLMs) excel at long-context understanding but exhibit significant limitations in long-form generation. Existing studies primarily focus on single-generation quality, generally overlooking the volatility of the output. This volatility not only leads to significant computational costs but also severely impacts the models' reliable application. To address this gap, our work unfolds in three stages: benchmarking, probing, and mitigation. We first propose the VOlatility in Long-form Text Benchmark (VOLTBench), a novel heterogeneous-task benchmark designed to systematically quantify the length volatility of long-form generation. Subsequently, by analyzing attention traces, we conduct an in-depth probe to identify several common internal patterns that cause this volatility. Finally, to mitigate long-form output volatility, we propose Stable Generation via Logits Boosting (GLoBo), a lightweight decoding-stage optimization strategy, designed to significantly enhance both the length accuracy and stability of long-form generation without additional training. Extensive experiments on VOLTBench provide the first systematic confirmation of severe long-form output instability in mainstream models and validate that our proposed method successfully improves the mean output length of the base model by 148% and reduces the length volatility by 69%, while maintaining high generation quality.

Problem

Research questions and friction points this paper is trying to address.

length volatility

long-form generation

output instability

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

length volatility

long-form generation

VOLTBench

logits boosting