🤖 AI Summary
This work investigates the transition of large language models (LLMs) from offline reinforcement learning (RL) to semi-online and fully online RL fine-tuning, addressing both verifiable (e.g., mathematical reasoning) and non-verifiable (e.g., instruction following) tasks. We propose a unified framework integrating DPO/GRPO objectives, hybrid online sampling, dynamic reward modeling, and joint multi-task optimization with verifiable and non-verifiable rewards. Our key contributions include: (i) the first systematic empirical finding that semi-online and fully online DPO/GRPO achieve comparable performance—both significantly surpassing pure offline RL baselines; and (ii) demonstration that joint multi-task training synergistically improves performance on both task types. Evaluated on mathematical reasoning and instruction-following benchmarks, our method outperforms existing offline RL approaches across all metrics, achieving faster convergence, enhanced training stability, and reproducible hyperparameter configurations. We further provide principled guidelines for analyzing training dynamics.
📝 Abstract
We investigate the effectiveness of reinforcement learning methods for finetuning large language models when transitioning from offline to semi-online to fully online regimes for both verifiable and non-verifiable tasks. Our experiments cover training on verifiable math as well as non-verifiable instruction following with a set of benchmark evaluations for both. Across these settings, we extensively compare online and semi-online Direct Preference Optimization and Group Reward Policy Optimization objectives, and surprisingly find similar performance and convergence between these variants, which all strongly outperform offline methods. We provide a detailed analysis of the training dynamics and hyperparameter selection strategies to achieve optimal results. Finally, we show that multi-tasking with verifiable and non-verifiable rewards jointly yields improved performance across both task types.