Generation Space Size: Understanding and Calibrating Open-Endedness of LLM Generations

📅 2025-10-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the diversity calibration imbalance of large language models (LLMs) in open-ended generation: insufficient diversity in creative tasks and excessive, hallucination-prone diversity in factual tasks. We propose the “Generative Space Size” (GSS) theoretical framework to unify the analysis of both phenomena. By constructing GSSBench—a dedicated benchmark—and developing an evaluation suite based on unsupervised internal representation metrics (e.g., EigenScore), we empirically reveal that both imbalances stem from structural biases in the model’s latent space. Our method integrates hallucination detection with semantic diversity quantification, enabling interpretable diagnosis of ambiguous prompts and reasoning biases, and supporting controllable generation. Experiments demonstrate that the framework significantly enhances diversity in creative generation while effectively suppressing factual hallucinations, thereby improving output consistency and reliability.

Technology Category

Application Category

📝 Abstract
Different open-ended generation tasks require different degrees of output diversity. However, current LLMs are often miscalibrated. They collapse to overly homogeneous outputs for creative tasks and hallucinate diverse but incorrect responses for factual tasks. We argue that these two failure modes are unified by, and can both be addressed by, the notion of effective generation space size (GSS) -- the set of semantically distinct outputs a model considers for a prompt. We present GSSBench, a task suite of prompt pairs with ground-truth GSS relationships to assess different metrics and understand where models diverge from desired behavior. We find that hallucination detection metrics, particularly EigenScore, consistently outperform standard diversity and uncertainty quantification metrics, while using only model internals, providing interpretable insights into a model's internal task representations. We demonstrate three applications of GSS: (1) detecting prompt ambiguity and predicting clarification questions for better grounding, (2) interpreting overthinking and underthinking in reasoning models, and (3) steering models to expand their generation space to yield high-quality and diverse outputs.
Problem

Research questions and friction points this paper is trying to address.

Calibrating LLM output diversity across different tasks
Addressing model collapse and hallucination through generation space
Detecting prompt ambiguity and improving model grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Measuring generation space size for model calibration
Using hallucination detection for internal task representation
Steering models to expand generation space diversity
🔎 Similar Papers
No similar papers found.
S
Sunny Yu
Department of Computer Science, Stanford University
A
Ahmad Jabbar
Department of Linguistics, Stanford University
R
Robert Hawkins
Department of Linguistics, Stanford University
Dan Jurafsky
Dan Jurafsky
Professor of Linguistics and Computer Science, Stanford University
Natural Language ProcessingSpeech RecognitionComputational LinguisticsLinguisticsComputational Social Science
Myra Cheng
Myra Cheng
Stanford