Rethinking Perplexity: Revealing the Impact of Input Length on Perplexity Evaluation in LLMs

📅 2026-02-04

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

Traditional perplexity evaluation becomes unreliable under long-context settings, and the impact of input length on the fairness and efficiency of language model assessment has not been systematically studied. This work proposes LengthBenchmark, a framework that treats input length as a first-class variable to jointly evaluate both predictive performance and system overhead—including latency, memory usage, and computational cost—of language models. The framework introduces two scoring protocols, direct accumulation and sliding window, and validates robustness using quantized models. Experiments reveal that the sliding window protocol inflates performance on short inputs, while increasing evaluation segment length consistently improves apparent model performance regardless of whether full-precision or quantized models are used. These findings demonstrate a pervasive length bias that significantly compromises fair model comparison.

Technology Category

Application Category

📝 Abstract

Perplexity is a widely adopted metric for assessing the predictive quality of large language models (LLMs) and often serves as a reference metric for downstream evaluations. However, recent evidence shows that perplexity can be unreliable, especially when irrelevant long inputs are used, raising concerns for both benchmarking and system deployment. While prior efforts have employed selective input filtering and curated datasets, the impact of input length on perplexity has not been systematically studied from a systems perspective and input length has rarely been treated as a first-class system variable affecting both fairness and efficiency. In this work, we close this gap by introducing LengthBenchmark, a system-conscious evaluation framework that explicitly integrates input length, evaluation protocol design, and system-level costs, evaluating representative LLMs under two scoring protocols (direct accumulation and fixed window sliding) across varying context lengths. Unlike prior work that focuses solely on accuracy-oriented metrics, LengthBenchmark additionally measures latency, memory footprint, and evaluation cost, thereby linking predictive metrics to deployment realities. We further incorporate quantized variants not as a main contribution, but as robustness checks, showing that length-induced biases persist across both full-precision and compressed models. This design disentangles the effects of evaluation logic, quantization, and input length, and demonstrates that length bias is a general phenomenon that undermines fair cross-model comparison. Our analysis yields two key observations: (i) sliding window evaluation consistently inflates performance on short inputs, and (ii) both full-precision and quantized models appear to realise gains as the evaluated segment length grows.

Problem

Research questions and friction points this paper is trying to address.

perplexity

input length

large language models

evaluation bias

benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

perplexity

input length

LengthBenchmark