Evaluating the Diversity and Quality of LLM Generated Content

📅 2025-04-16

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Preference tuning methods (e.g., RLHF, DPO) often reduce generative diversity in LLMs, yet real-world applications require balancing output quality and diversity. Method: We propose “effective semantic diversity”—the semantic dissimilarity among outputs that meet a minimum quality threshold—and introduce an unsupervised evaluation framework integrating semantic similarity and automated quality scoring to systematically compare SFT, PPO, GRPO, and DPO across open-ended tasks. Results: (1) Preference tuning reduces surface-level diversity but significantly increases the proportion of high-quality outputs and effective semantic diversity; (2) surface-form diversity and content-level diversity are fundamentally decoupled; (3) smaller models generate more unique, high-quality content per parameter under fixed sampling. This work provides an interpretable, practical evaluation benchmark for creative assistance and synthetic data generation.

Technology Category

Application Category

📝 Abstract

Recent work suggests that preference-tuning techniques--including Reinforcement Learning from Human Preferences (RLHF) methods like PPO and GRPO, as well as alternatives like DPO--reduce diversity, creating a dilemma given that such models are widely deployed in applications requiring diverse outputs. To address this, we introduce a framework for measuring effective semantic diversity--diversity among outputs that meet quality thresholds--which better reflects the practical utility of large language models (LLMs). Using open-ended tasks that require no human intervention, we find counterintuitive results: although preference-tuned models--especially those trained via RL--exhibit reduced lexical and syntactic diversity, they produce greater effective semantic diversity than SFT or base models, not from increasing diversity among high-quality outputs, but from generating more high-quality outputs overall. We discover that preference tuning reduces syntactic diversity while preserving semantic diversity--revealing a distinction between diversity in form and diversity in content that traditional metrics often overlook. Our analysis further shows that smaller models are consistently more parameter-efficient at generating unique content within a fixed sampling budget, offering insights into the relationship between model scaling and diversity. These findings have important implications for applications that require diverse yet high-quality outputs, from creative assistance to synthetic data generation.

Problem

Research questions and friction points this paper is trying to address.

Measure effective semantic diversity in LLM outputs

Assess impact of preference-tuning on diversity and quality

Explore parameter-efficiency in generating diverse content

Innovation

Methods, ideas, or system contributions that make the work stand out.

Framework measures effective semantic diversity

Preference tuning reduces syntactic diversity

Smaller models more parameter-efficient for diversity

🔎 Similar Papers

No similar papers found.