Modifying Large Language Model Post-Training for Diverse Creative Writing

📅 2025-03-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current preference learning methods for creative writing overemphasize output quality while neglecting diversity during LLM post-training. To address this limitation, we propose a preference learning framework that jointly optimizes diversity and quality. Our key innovation is the first explicit modeling of sample-level deviation—quantifying how much a generated sample diverges from dominant modes—as a distinct optimization objective, thereby decoupling diversity and quality control. Building upon DPO and ORPO, we introduce a deviation-aware loss that enhances model learning from rare yet high-quality samples. Experiments on an 8B-parameter model demonstrate substantial improvements in output diversity—matching human-written datasets—while preserving high quality, approaching that of GPT-4o and DeepSeek-R1. Our method consistently outperforms baselines including DivPO. Ablation studies and human evaluations confirm the effectiveness and generalizability of deviation modeling.

Technology Category

Application Category

📝 Abstract
As creative writing tasks do not have singular correct answers, large language models (LLMs) trained to perform these tasks should be able to generate diverse valid outputs. However, LLM post-training often focuses on improving generation quality but neglects to facilitate output diversity. Hence, in creative writing generation, we investigate post-training approaches to promote both output diversity and quality. Our core idea is to include deviation -- the degree of difference between a training sample and all other samples with the same prompt -- in the training objective to facilitate learning from rare high-quality instances. By adopting our approach to direct preference optimization (DPO) and odds ratio preference optimization (ORPO), we demonstrate that we can promote the output diversity of trained models while minimally decreasing quality. Our best model with 8B parameters could achieve on-par diversity as a human-created dataset while having output quality similar to the best instruction-tuned models we examined, GPT-4o and DeepSeek-R1. We further validate our approaches with a human evaluation, an ablation, and a comparison to an existing diversification approach, DivPO.
Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM output diversity in creative writing tasks
Balancing diversity and quality in post-training LLMs
Incorporating deviation metric to learn from rare instances
Innovation

Methods, ideas, or system contributions that make the work stand out.

Incorporates deviation into training objective
Uses DPO and ORPO for diversity
Balances diversity and quality effectively
🔎 Similar Papers
No similar papers found.