🤖 AI Summary
Current preference learning methods for creative writing overemphasize output quality while neglecting diversity during LLM post-training. To address this limitation, we propose a preference learning framework that jointly optimizes diversity and quality. Our key innovation is the first explicit modeling of sample-level deviation—quantifying how much a generated sample diverges from dominant modes—as a distinct optimization objective, thereby decoupling diversity and quality control. Building upon DPO and ORPO, we introduce a deviation-aware loss that enhances model learning from rare yet high-quality samples. Experiments on an 8B-parameter model demonstrate substantial improvements in output diversity—matching human-written datasets—while preserving high quality, approaching that of GPT-4o and DeepSeek-R1. Our method consistently outperforms baselines including DivPO. Ablation studies and human evaluations confirm the effectiveness and generalizability of deviation modeling.
📝 Abstract
As creative writing tasks do not have singular correct answers, large language models (LLMs) trained to perform these tasks should be able to generate diverse valid outputs. However, LLM post-training often focuses on improving generation quality but neglects to facilitate output diversity. Hence, in creative writing generation, we investigate post-training approaches to promote both output diversity and quality. Our core idea is to include deviation -- the degree of difference between a training sample and all other samples with the same prompt -- in the training objective to facilitate learning from rare high-quality instances. By adopting our approach to direct preference optimization (DPO) and odds ratio preference optimization (ORPO), we demonstrate that we can promote the output diversity of trained models while minimally decreasing quality. Our best model with 8B parameters could achieve on-par diversity as a human-created dataset while having output quality similar to the best instruction-tuned models we examined, GPT-4o and DeepSeek-R1. We further validate our approaches with a human evaluation, an ablation, and a comparison to an existing diversification approach, DivPO.