RLMR: Reinforcement Learning with Mixed Rewards for Creative Writing

📅 2025-08-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Creative writing requires balancing subjective quality (e.g., literary merit, emotional expressiveness) with objective constraints (e.g., format, length), yet existing reinforcement learning (RL) approaches struggle to dynamically reconcile these competing objectives. To address this, we propose an online RL framework with dynamic hybrid reward modeling—the first to jointly incorporate subjective preferences (assessed by a writing reward model) and objective constraint satisfaction (verified by a constraint detection model) during training. Crucially, the framework adaptively modulates the weight of constraint rewards based on intra-batch sample quality, enabling precise penalization of violations. Implemented via GRPO for end-to-end optimization, our method is validated across 8B–72B language models. Results show a 3.29-percentage-point improvement in instruction adherence (83.36% → 86.65%) and a 72.75% win rate in human evaluations. We further introduce WriteEval—the first comprehensive benchmark tailored to realistic creative writing scenarios.

Technology Category

Application Category

📝 Abstract
Large language models are extensively utilized in creative writing applications. Creative writing requires a balance between subjective writing quality (e.g., literariness and emotional expression) and objective constraint following (e.g., format requirements and word limits). Existing reinforcement learning methods struggle to balance these two aspects: single reward strategies fail to improve both abilities simultaneously, while fixed-weight mixed-reward methods lack the ability to adapt to different writing scenarios. To address this problem, we propose Reinforcement Learning with Mixed Rewards (RLMR), utilizing a dynamically mixed reward system from a writing reward model evaluating subjective writing quality and a constraint verification model assessing objective constraint following. The constraint following reward weight is adjusted dynamically according to the writing quality within sampled groups, ensuring that samples violating constraints get negative advantage in GRPO and thus penalized during training, which is the key innovation of this proposed method. We conduct automated and manual evaluations across diverse model families from 8B to 72B parameters. Additionally, we construct a real-world writing benchmark named WriteEval for comprehensive evaluation. Results illustrate that our method achieves consistent improvements in both instruction following (IFEval from 83.36% to 86.65%) and writing quality (72.75% win rate in manual expert pairwise evaluations on WriteEval). To the best of our knowledge, RLMR is the first work to combine subjective preferences with objective verification in online RL training, providing an effective solution for multi-dimensional creative writing optimization.
Problem

Research questions and friction points this paper is trying to address.

Balancing subjective writing quality with objective constraints in creative writing
Adapting reward weights dynamically for different writing scenarios
Improving both instruction following and writing quality simultaneously
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamically mixed reward system balances writing quality
Constraint reward weight adjusted by writing quality samples
Negative advantage penalizes constraint violations in GRPO training
🔎 Similar Papers
No similar papers found.
J
Jianxing Liao
Tencent Hunyuan Team
T
Tian Zhang
Tencent Hunyuan Team
X
Xiao Feng
Tencent Hunyuan Team
Y
Yusong Zhang
Tencent Hunyuan Team
R
Rui Yang
Tencent Hunyuan Team
Haorui Wang
Haorui Wang
PhD student, Gatech
Machine LearningLarge Language ModelsDecision MakingUncertainty Quantification
Bosi Wen
Bosi Wen
Tsinghua University
Natural Language Processing
Z
Ziying Wang
Peking University
R
Runzhi Shi
Peking University