Brevity is the soul of sustainability: Characterizing LLM response lengths

📅 2025-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) incur excessive energy consumption during inference due to response over-generation. Method: This work systematically defines six categories of information components in LLM outputs, identifying redundancy primarily arising from unnecessary explanations, repetitive phrasing, and superfluous structural elements. Building on this analysis, we propose a lightweight, structured prompt engineering paradigm targeting both response length compression and information density enhancement—without compromising accuracy, completeness, or readability. Contribution/Results: Extensive experiments across 12 mainstream decoder-only LLMs and 5 benchmark datasets demonstrate average response length reductions of 35–60% and corresponding inference energy savings of 25–60%. Our approach consistently outperforms existing response compression and model pruning techniques, achieving joint optimization of efficiency and output quality.

Technology Category

Application Category

📝 Abstract
A significant portion of the energy consumed by Large Language Models (LLMs) arises from their inference processes; hence developing energy-efficient methods for inference is crucial. While several techniques exist for inference optimization, output compression remains relatively unexplored, with only a few preliminary efforts addressing this aspect. In this work, we first benchmark 12 decoder-only LLMs across 5 datasets, revealing that these models often produce responses that are substantially longer than necessary. We then conduct a comprehensive quality assessment of LLM responses, formally defining six information categories present in LLM responses. We show that LLMs often tend to include redundant or additional information besides the minimal answer. To address this issue of long responses by LLMs, we explore several simple and intuitive prompt-engineering strategies. Empirical evaluation shows that appropriate prompts targeting length reduction and controlling information content can achieve significant energy optimization between 25-60% by reducing the response length while preserving the quality of LLM responses.
Problem

Research questions and friction points this paper is trying to address.

Energy-efficient inference in Large Language Models
Reducing redundant information in LLM responses
Prompt-engineering for shorter, quality-preserving outputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarking 12 LLMs across 5 datasets
Defining six information categories in responses
Prompt-engineering for energy-efficient responses
🔎 Similar Papers
No similar papers found.