🤖 AI Summary
This work identifies a “diversity collapse” phenomenon in instruction-tuned large language models (LLMs), wherein rigid structural templates—such as role markers and special symbols—induce semantic homogenization in open-ended generation, severely constraining creativity and output variability. We formally define and empirically validate this effect, demonstrating that structured tokens—not sampling parameters like temperature—are the primary causal factor. To analyze the trade-off between format fidelity and generative diversity, we propose a novel analytical framework. Through comprehensive evaluation across multiple benchmarks (GSM8K, IFEval, MMLU), controlled fine-tuning, high-temperature ablation, and structured-token removal experiments, we show that structural constraints are necessary for format-sensitive tasks but superfluous for knowledge-intensive ones. Critically, minimizing structured tokens significantly enhances output diversity without degrading core task performance.
📝 Abstract
Instruction-tuned large language models (LLMs) employ structured templates, such as role markers and special tokens, to enforce format consistency during inference. However, we identify a critical limitation of such formatting: it induces a phenomenon we term diversity collapse, where the model generates semantically similar outputs for open-ended inputs, undermining creativity and variability. We systematically evaluate this effect across tasks like story completion and free-form generation, finding that (1) diversity collapse persists even under high-temperature sampling, and (2) structural tokens in templates significantly constrain the model's output space. To contextualize these findings, we fine-tune the same model using a range of structured prompts and then evaluate them across three axes: downstream task performance, alignment behavior, and output diversity. Our analysis shows that format consistency between fine-tuning and inference is crucial for structure-sensitive tasks (e.g., GSM8K, IFEval), but has marginal influence on knowledge-heavy tasks (e.g., MMLU, WebQuestions). In contrast, output diversity is primarily governed by the presence or absence of structural tokens, with minimal formatting yielding the most diverse outputs. These findings reveal that current prompting conventions, while beneficial for alignment, may inadvertently suppress output diversity, underscoring the need for diversity-aware prompt design and instruction tuning.