Where does output diversity collapse in post-training?

📅 2026-04-17

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This study investigates the significant degradation in output diversity commonly observed after post-training of language models, which adversely affects reasoning methods reliant on diverse sampling and creative tasks. By tracking the evolution of OLMo-3 across three distinct post-training trajectories—Think, Instruct, and RL-Zero—the work isolates, for the first time, the independent contributions of training data composition and training methodology to diversity collapse, demonstrating that the primary driver lies in the training data rather than inference formatting. Through multi-trajectory comparisons (including SFT and DPO), four diversity metrics, evaluation across 15 tasks, and chain-of-thought suppression experiments, the authors find that the Think trajectory suffers the greatest semantic diversity loss during SFT, while Instruct is more severely impacted by DPO. Crucially, correct-answer diversity is predominantly preserved in the Think path, and diversity collapse cannot be mitigated solely at inference time.

Technology Category

Application Category

📝 Abstract

Post-trained language models produce less varied outputs than their base counterparts. This output diversity collapse undermines inference-time scaling methods that rely on varied samples, and risks homogenizing model outputs on creative and value-laden tasks. Prior work attributes collapse to specific post-training methods, without separating the role of training data composition from the method, or the generation format from the model weights. We trace output diversity through three parallel post-training lineages of Olmo 3, Think (chain-of-thought distillation), Instruct (broad multi-source data), and RL-Zero, across 15 tasks and four text diversity metrics. We find that the location of collapse co-varies with data composition: the Think lineage loses most semantic diversity at supervised fine-tuning, and the effect of DPO is larger in Instruct than in Think. Suppressing chain-of-thought reasoning at inference in Think models drops accuracy on hard tasks, yet leaves answer-level diversity unchanged, showing that the collapse is embedded in the model weights by training data, not imposed by the generation format. Decomposing diversity loss on six verifiable tasks into a quality-control component (removal of incorrect outputs) and a residual component (genuine narrowing among correct outputs) reveals that the split is task-dependent, and Think models retain more correct-answer diversity than Instruct despite collapsing more in aggregate. Our results indicate that diversity collapse is determined during training by data composition and cannot be addressed at inference time alone.

Problem

Research questions and friction points this paper is trying to address.

output diversity collapse

post-training

training data composition

language models

inference-time scaling

Innovation

Methods, ideas, or system contributions that make the work stand out.

output diversity collapse

post-training

data composition