RLMR: Reinforcement Learning with Mixed Rewards for Creative Writing

📅 2025-08-25

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Creative writing requires balancing subjective quality (e.g., literary merit, emotional expressiveness) with objective constraints (e.g., format, length), yet existing reinforcement learning (RL) approaches struggle to dynamically reconcile these competing objectives. To address this, we propose an online RL framework with dynamic hybrid reward modeling—the first to jointly incorporate subjective preferences (assessed by a writing reward model) and objective constraint satisfaction (verified by a constraint detection model) during training. Crucially, the framework adaptively modulates the weight of constraint rewards based on intra-batch sample quality, enabling precise penalization of violations. Implemented via GRPO for end-to-end optimization, our method is validated across 8B–72B language models. Results show a 3.29-percentage-point improvement in instruction adherence (83.36% → 86.65%) and a 72.75% win rate in human evaluations. We further introduce WriteEval—the first comprehensive benchmark tailored to realistic creative writing scenarios.

Technology Category

Application Category

📝 Abstract

Large language models are extensively utilized in creative writing applications. Creative writing requires a balance between subjective writing quality (e.g., literariness and emotional expression) and objective constraint following (e.g., format requirements and word limits). Existing reinforcement learning methods struggle to balance these two aspects: single reward strategies fail to improve both abilities simultaneously, while fixed-weight mixed-reward methods lack the ability to adapt to different writing scenarios. To address this problem, we propose Reinforcement Learning with Mixed Rewards (RLMR), utilizing a dynamically mixed reward system from a writing reward model evaluating subjective writing quality and a constraint verification model assessing objective constraint following. The constraint following reward weight is adjusted dynamically according to the writing quality within sampled groups, ensuring that samples violating constraints get negative advantage in GRPO and thus penalized during training, which is the key innovation of this proposed method. We conduct automated and manual evaluations across diverse model families from 8B to 72B parameters. Additionally, we construct a real-world writing benchmark named WriteEval for comprehensive evaluation. Results illustrate that our method achieves consistent improvements in both instruction following (IFEval from 83.36% to 86.65%) and writing quality (72.75% win rate in manual expert pairwise evaluations on WriteEval). To the best of our knowledge, RLMR is the first work to combine subjective preferences with objective verification in online RL training, providing an effective solution for multi-dimensional creative writing optimization.

Problem

Research questions and friction points this paper is trying to address.

Balancing subjective writing quality with objective constraints in creative writing

Adapting reward weights dynamically for different writing scenarios

Improving both instruction following and writing quality simultaneously

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamically mixed reward system balances writing quality

Constraint reward weight adjusted by writing quality samples

Negative advantage penalizes constraint violations in GRPO training

🔎 Similar Papers

Divergent Creativity in Humans and Large Language Models

2024-05-13arXiv.orgCitations: 10

Agents' Room: Narrative Generation through Multi-step Collaboration

2024-10-03arXiv.orgCitations: 4

Anthropic

$500,000—$850,000 USD

San Francisco, CA, USA

AI Research Scientist, Language - Monetization GenAI