One-Token Rollout: Guiding Supervised Fine-Tuning of LLMs with Policy Gradient

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Supervised fine-tuning (SFT) exhibits weaker generalization than reinforcement learning (RL), not merely due to differences in loss functions, but primarily because SFT relies on static offline data, whereas RL leverages online data generated by the current policy. Method: We propose One-Token Rollout (OTR), the first method to model each token generation in SFT as a single-step on-policy RL trajectory. OTR employs Monte Carlo rollouts and policy gradients to dynamically construct token-level reward signals based solely on the current model’s output—eliminating the need for full-sequence sampling and enabling online-style learning from offline data. Contribution/Results: By reusing offline datasets to generate policy-adaptive learning targets, OTR significantly improves generalization. It consistently outperforms standard SFT across challenging benchmarks—including mathematical reasoning, code generation, and general-purpose reasoning—demonstrating both effectiveness and broad applicability.

Technology Category

Application Category

📝 Abstract

Supervised fine-tuning (SFT) is the predominant method for adapting large language models (LLMs), yet it often struggles with generalization compared to reinforcement learning (RL). In this work, we posit that this performance disparity stems not just from the loss function, but from a more fundamental difference: SFT learns from a fixed, pre-collected dataset, whereas RL utilizes on-policy data sampled from the current policy. Building on this hypothesis, we introduce one-token rollout (OTR), a novel fine-tuning algorithm that guides SFT with the policy gradient method. OTR reframes the autoregressive learning process by treating each token generation as a single-step reinforcement learning trajectory. At each step, it performs a Monte Carlo ``rollout'' by sampling multiple candidate tokens from the current policy's distribution. The ground-truth token from the supervised data is then used to provide a reward signal to these samples. Guided by policy gradient, our algorithm repurposes static, off-policy supervised data into a dynamic, on-policy signal at the token level, capturing the generalization benefits of on-policy learning while bypassing the costly overhead of full sentence generation. Through extensive experiments on a diverse suite of challenging benchmarks spanning mathematical reasoning, code generation, and general domain reasoning, we demonstrate that OTR consistently outperforms standard SFT. Our findings establish OTR as a powerful and practical alternative for fine-tuning LLMs and provide compelling evidence that the on-policy nature of data is a critical driver of generalization, offering a promising new direction for fine-tuning LLMs.

Problem

Research questions and friction points this paper is trying to address.

Improving generalization of supervised fine-tuning for LLMs

Integrating policy gradient methods into SFT training process

Converting static supervised data into dynamic on-policy signals

Innovation

Methods, ideas, or system contributions that make the work stand out.

One-token rollout algorithm guides SFT with policy gradient

Treats token generation as single-step reinforcement learning trajectory

Repurposes static supervised data into dynamic on-policy signal

🔎 Similar Papers

No similar papers found.