Bridging Offline and Online Reinforcement Learning for LLMs

📅 2025-06-26

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

This work investigates the transition of large language models (LLMs) from offline reinforcement learning (RL) to semi-online and fully online RL fine-tuning, addressing both verifiable (e.g., mathematical reasoning) and non-verifiable (e.g., instruction following) tasks. We propose a unified framework integrating DPO/GRPO objectives, hybrid online sampling, dynamic reward modeling, and joint multi-task optimization with verifiable and non-verifiable rewards. Our key contributions include: (i) the first systematic empirical finding that semi-online and fully online DPO/GRPO achieve comparable performance—both significantly surpassing pure offline RL baselines; and (ii) demonstration that joint multi-task training synergistically improves performance on both task types. Evaluated on mathematical reasoning and instruction-following benchmarks, our method outperforms existing offline RL approaches across all metrics, achieving faster convergence, enhanced training stability, and reproducible hyperparameter configurations. We further provide principled guidelines for analyzing training dynamics.

Technology Category

Application Category

📝 Abstract

We investigate the effectiveness of reinforcement learning methods for finetuning large language models when transitioning from offline to semi-online to fully online regimes for both verifiable and non-verifiable tasks. Our experiments cover training on verifiable math as well as non-verifiable instruction following with a set of benchmark evaluations for both. Across these settings, we extensively compare online and semi-online Direct Preference Optimization and Group Reward Policy Optimization objectives, and surprisingly find similar performance and convergence between these variants, which all strongly outperform offline methods. We provide a detailed analysis of the training dynamics and hyperparameter selection strategies to achieve optimal results. Finally, we show that multi-tasking with verifiable and non-verifiable rewards jointly yields improved performance across both task types.

Problem

Research questions and friction points this paper is trying to address.

Effectiveness of RL methods for LLM finetuning across offline to online regimes

Comparison of online and semi-online DPO and GRPO objectives for tasks

Impact of multi-tasking verifiable and non-verifiable rewards on performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transitioning offline to online reinforcement learning

Comparing DPO and GRPO objectives extensively

Multi-tasking verifiable and non-verifiable rewards

🔎 Similar Papers

No similar papers found.