🤖 AI Summary
This work addresses the challenge of balancing global coherence in long-form narratives with local expressiveness in short texts, a trade-off exacerbated by reliance on high-quality supervised data and static reward signals. The authors propose a reference-free, unified reinforcement learning framework that dynamically generates task-specific evaluation criteria to directly align with human preferences. Central to this approach are two innovations: an Adaptive Constraint-aware Reward Model (AC-GenRM) and an accompanying policy optimization algorithm (ACPO). Together, they enable the model—without supervised fine-tuning—to autonomously distinguish between tasks requiring deliberate planning and those amenable to direct generation, thereby exhibiting metacognitive capabilities. Experimental results demonstrate strong alignment between AC-GenRM and expert human judgments, while ACPO significantly enhances text generation quality across diverse writing tasks, validating the model’s ability to self-determine optimal generation strategies.
📝 Abstract
A fundamental challenge in creative writing lies in reconciling the inherent tension between maintaining global coherence in long-form narratives and preserving local expressiveness in short-form texts. While long-context generation necessitates explicit macroscopic planning, short-form creativity often demands spontaneous, constraint-free expression. Existing alignment paradigms, however, typically employ static reward signals and rely heavily on high-quality supervised data, which is costly and difficult to scale. To address this, we propose \textbf{UniCreative}, a unified reference-free reinforcement learning framework. We first introduce \textbf{AC-GenRM}, an adaptive constraint-aware reward model that dynamically synthesizes query-specific criteria to provide fine-grained preference judgments. Leveraging these signals, we propose \textbf{ACPO}, a policy optimization algorithm that aligns models with human preferences across both content quality and structural paradigms without supervised fine-tuning and ground-truth references. Empirical results demonstrate that AC-GenRM aligns closely with expert evaluations, while ACPO significantly enhances performance across diverse writing tasks. Crucially, our analysis reveals an emergent meta-cognitive ability: the model learns to autonomously differentiate between tasks requiring rigorous planning and those favoring direct generation, validating the effectiveness of our direct alignment approach.