๐ค AI Summary
This work addresses the challenge of jointly optimizing imitation learning (IL) and reinforcement learning (RL) in online fine-tuning of large language models (LLMs). We propose a unified framework that enforces trajectory-level KL divergence constraints to preserve imitation fidelity while leveraging task rewards for long-horizon optimization, enabled by gradient decoupling. Our key contribution is the first derivation of a closed-form token-level IL gradient in logit space, decomposing the composite objective into analytically computable dense gradients (for token-level IL) and sparse Monte Carloโestimated gradients (for reward-driven RL), enabling GPU-native efficient online hybrid updates. Experiments on multi-task instruction tuning show that our method reduces policy variance by 30% compared to pure RLHF, significantly improving training stability and sample efficiency, while maintaining high-fidelity behavioral imitation.
๐ Abstract
We present a unified framework for Large Language Model (LLM) fine-tuning that integrates Imitation Learning and Reinforcement Learning. By analyzing the gradient of a composite objective combining trajectory-level KL divergence with task rewards, we derive a natural decomposition into two components: (1) an analytically computable Dense Gradient for token-level imitation, and (2) a Monte Carlo estimated Sparse Gradient for long-horizon reward optimization. The Dense Gradient admits a closed-form logit-level formula, enabling efficient GPU implementation.