ACE-RL: Adaptive Constraint-Enhanced Reward for Long-form Generation Reinforcement Learning

📅 2025-09-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) face dual limitations in high-quality long-text generation: heavy reliance on scarce human-annotated data and optimization only over coarse-grained quality dimensions (e.g., relevance, coherence). Method: We propose an Adaptive Constraint-Augmented Reward (ACAR) framework that introduces the first instruction-intent-driven mechanism for dynamically constructing and validating fine-grained, interpretable constraints—transforming subjective quality assessment into a quantifiable constraint-satisfaction problem. ACAR jointly trains via intent-aware constraint decomposition, constraint-matching reward modeling, and reinforcement learning, drastically reducing dependence on paired preference data or supervised fine-tuning (SFT) annotations. Results: On WritingBench, ACAR outperforms SFT and RL baselines by 20.70% and 7.32%, respectively, and surpasses GPT-4o by 7.10%. It demonstrates strong generalization and controllability across diverse long-text generation scenarios.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have demonstrated remarkable progress in long-context understanding, yet they face significant challenges in high-quality long-form generation. Existing studies primarily suffer from two limitations: (1) A heavy reliance on scarce, high-quality long-form response data for supervised fine-tuning (SFT) or for pairwise preference reward in reinforcement learning (RL). (2) Focus on coarse-grained quality optimization dimensions, such as relevance, coherence, and helpfulness, overlooking the fine-grained specifics inherent to diverse long-form generation scenarios. To address this issue, we propose a framework using Adaptive Constraint-Enhanced reward for long-form generation Reinforcement Learning (ACE-RL). ACE-RL first automatically deconstructs each instruction into a set of fine-grained, adaptive constraint criteria by identifying its underlying intents and demands. Subsequently, we design a reward mechanism that quantifies the quality of long-form responses based on their satisfaction over corresponding constraints, converting subjective quality evaluation into constraint verification. Finally, we utilize reinforcement learning to guide models toward superior long-form generation capabilities. Experimental results demonstrate that our ACE-RL framework significantly outperforms existing SFT and RL baselines by 20.70% and 7.32% on WritingBench, and our top-performing model even surpasses proprietary systems like GPT-4o by 7.10%, providing a more effective training paradigm for LLMs to generate high-quality content across diverse long-form generation scenarios.
Problem

Research questions and friction points this paper is trying to address.

Addresses reliance on scarce high-quality data for long-form generation
Overcomes coarse-grained optimization in long-form content quality
Converts subjective quality evaluation into constraint verification metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive constraint criteria deconstruction from instructions
Constraint satisfaction reward mechanism for quality
Reinforcement learning for superior long-form generation
🔎 Similar Papers
No similar papers found.