UniCreative: Unifying Long-form Logic and Short-form Sparkle via Reference-Free Reinforcement Learning

📅 2026-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of balancing global coherence in long-form narratives with local expressiveness in short texts, a trade-off exacerbated by reliance on high-quality supervised data and static reward signals. The authors propose a reference-free, unified reinforcement learning framework that dynamically generates task-specific evaluation criteria to directly align with human preferences. Central to this approach are two innovations: an Adaptive Constraint-aware Reward Model (AC-GenRM) and an accompanying policy optimization algorithm (ACPO). Together, they enable the model—without supervised fine-tuning—to autonomously distinguish between tasks requiring deliberate planning and those amenable to direct generation, thereby exhibiting metacognitive capabilities. Experimental results demonstrate strong alignment between AC-GenRM and expert human judgments, while ACPO significantly enhances text generation quality across diverse writing tasks, validating the model’s ability to self-determine optimal generation strategies.
📝 Abstract
A fundamental challenge in creative writing lies in reconciling the inherent tension between maintaining global coherence in long-form narratives and preserving local expressiveness in short-form texts. While long-context generation necessitates explicit macroscopic planning, short-form creativity often demands spontaneous, constraint-free expression. Existing alignment paradigms, however, typically employ static reward signals and rely heavily on high-quality supervised data, which is costly and difficult to scale. To address this, we propose \textbf{UniCreative}, a unified reference-free reinforcement learning framework. We first introduce \textbf{AC-GenRM}, an adaptive constraint-aware reward model that dynamically synthesizes query-specific criteria to provide fine-grained preference judgments. Leveraging these signals, we propose \textbf{ACPO}, a policy optimization algorithm that aligns models with human preferences across both content quality and structural paradigms without supervised fine-tuning and ground-truth references. Empirical results demonstrate that AC-GenRM aligns closely with expert evaluations, while ACPO significantly enhances performance across diverse writing tasks. Crucially, our analysis reveals an emergent meta-cognitive ability: the model learns to autonomously differentiate between tasks requiring rigorous planning and those favoring direct generation, validating the effectiveness of our direct alignment approach.
Problem

Research questions and friction points this paper is trying to address.

creative writing
global coherence
local expressiveness
reinforcement learning
reference-free alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

reference-free reinforcement learning
adaptive reward modeling
unified creative generation
policy optimization
meta-cognitive ability
🔎 Similar Papers
No similar papers found.
X
Xiaolong Wei
Beihang University
Z
Zerun Zhu
Baidu Inc.
S
Simin Niu
Renmin University of China
Xingyu Zhang
Xingyu Zhang
Horizon Robotics Inc
NLP&VLM&AD
Peiying Yu
Peiying Yu
Soochow University
C
Changxuan Xiao
Baidu Inc.
Y
Yuchen Li
Baidu Inc.
J
Jicheng Yang
Baidu Inc.
Z
Zhejun Zhao
Baidu Inc.
C
Chong Meng
Baidu Inc.
Long Xia
Long Xia
Research Scientist, Baidu
information retrievaldata miningapplied machine learningrecommender system
D
Daiting Shi
Baidu Inc.