🤖 AI Summary
This work addresses the limitation of static constraints in reinforcement learning fine-tuning, which often suppress a model’s ability to explore superior solutions while preventing degenerate outputs. To overcome this trade-off, the authors propose a dynamic constraint mechanism that employs a reference model as an online corrector, applying minimal intervention only when degenerate outputs are detected. This approach is combined with supervised fine-tuning loss to guide the model toward high-quality responses, allowing the constraint strength to adaptively scale with output quality. Evaluated on dialogue and code generation tasks, the method significantly outperforms both KL-regularized and unconstrained baselines, achieving higher task rewards without compromising training stability—thus effectively balancing exploration capability with constraint efficacy.
📝 Abstract
Constraints are essential for stabilizing reinforcement learning fine-tuning (RFT) and preventing degenerate outputs, yet they inherently conflict with the optimization objective because stronger constraints limit the ability of a fine-tuned model to discover better solutions. We propose \textit{dynamic constraints} that resolve this tension by adapting to the evolving capabilities of the fine-tuned model based on the insight that constraints should only intervene when degenerate outputs occur. We implement this by using a reference model as an \textit{online refiner} that takes the response from the fine-tuned model and generates a minimally corrected version which preserves correct content verbatim while fixing errors. A supervised fine-tuning loss then trains the fine-tuned model to produce the refined output. This mechanism yields a constraint that automatically strengthens or relaxes based on output quality. Experiments on dialogue and code generation show that dynamic constraints outperform both KL regularization and unconstrained baselines, achieving substantially higher task rewards while maintaining training stability.