A Penalty Goes a Long Way: Measuring Lexical Diversity in Synthetic Texts Under Prompt-Influenced Length Variations

📅 2025-07-20

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Existing lexical diversity metrics (e.g., MATTR) are susceptible to prompt-induced length variation when evaluating LLM-generated text, systematically underestimating diversity in longer responses. To address this, we propose PATTR—the first diversity metric that explicitly incorporates task-specific target length for bias correction, jointly optimizing accuracy in diversity assessment and adherence to desired output length. Evaluated on over 20 million tokens of video script data generated by seven models across the LLaMA, OLMo, and Phi families, PATTR demonstrates significantly improved detection of high-diversity outputs. It maintains strict length consistency while matching or exceeding the diversity evaluation performance of baselines—including MATTR and compression ratio—across diverse model architectures. By decoupling diversity from spurious length correlations, PATTR provides a more robust, interpretable, and task-aligned quantitative tool for assessing LLM text quality.

Technology Category

Application Category

📝 Abstract

Synthetic text generated by Large Language Models (LLMs) is increasingly used for further training and improvement of LLMs. Diversity is crucial for the effectiveness of synthetic data, and researchers rely on prompt engineering to improve diversity. However, the impact of prompt variations on response text length, and, more importantly, the consequential effect on lexical diversity measurements, remain underexplored. In this work, we propose Penalty-Adjusted Type-Token Ratio (PATTR), a diversity metric robust to length variations. We generate a large synthetic corpus of over 20M words using seven models from the LLaMA, OLMo, and Phi families, focusing on a creative writing task of video script generation, where diversity is crucial. We evaluate per-response lexical diversity using PATTR and compare it against existing metrics of Moving-Average TTR (MATTR) and Compression Ratio (CR). Our analysis highlights how text length variations introduce biases favoring shorter responses. Unlike existing metrics, PATTR explicitly considers the task-specific target response length ($L_T$) to effectively mitigate length biases. We further demonstrate the utility of PATTR in filtering the top-10/100/1,000 most lexically diverse responses, showing that it consistently outperforms MATTR and CR by yielding on par or better diversity with high adherence to $L_T$.

Problem

Research questions and friction points this paper is trying to address.

Measure lexical diversity in synthetic texts under length variations

Assess impact of prompt variations on text length and diversity

Propose robust diversity metric to mitigate length bias

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes Penalty-Adjusted Type-Token Ratio (PATTR)

Generates 20M-word corpus using seven LLMs

PATTR mitigates length biases in diversity metrics

🔎 Similar Papers

Standardizing the Measurement of Text Diversity: A Tool and a Comparative Analysis of Scores