Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning

📅 2025-10-29

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Small-scale open-source LLMs underperform on multi-step reasoning tasks: supervised fine-tuning (SFT) suffers from overfitting due to token-level imitation of lengthy demonstrations, while reinforcement learning with verifiable rewards (RLVR) struggles to converge when correct solutions are sparse. To address this, we propose the “Actionized Reasoning” framework, which models problem-solving as the generation of logical “actions”—discrete, interpretable reasoning steps. We construct step-level, fine-grained supervision signals from expert trajectories. Methodologically, we first apply SFT to distill expert actions and internalize them into a reasoning monologue mechanism; then, we design a smooth reward function based on action similarity to provide positive feedback for partially correct reasoning. This framework synergistically combines SFT’s stability with RLVR’s exploratory capability, significantly improving small-model performance on complex reasoning benchmarks—outperforming both pure SFT and RLVR baselines. It further generalizes successfully to agent-oriented tasks such as intelligent software engineering, demonstrating broad applicability and effectiveness.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) often struggle with problems that require multi-step reasoning. For small-scale open-source models, Reinforcement Learning with Verifiable Rewards (RLVR) fails when correct solutions are rarely sampled even after many attempts, while Supervised Fine-Tuning (SFT) tends to overfit long demonstrations through rigid token-by-token imitation. To address this gap, we propose Supervised Reinforcement Learning (SRL), a framework that reformulates problem solving as generating a sequence of logical "actions". SRL trains the model to generate an internal reasoning monologue before committing to each action. It provides smoother rewards based on the similarity between the model's actions and expert actions extracted from the SFT dataset in a step-wise manner. This supervision offers richer learning signals even when all rollouts are incorrect, while encouraging flexible reasoning guided by expert demonstrations. As a result, SRL enables small models to learn challenging problems previously unlearnable by SFT or RLVR. Moreover, initializing training with SRL before refining with RLVR yields the strongest overall performance. Beyond reasoning benchmarks, SRL generalizes effectively to agentic software engineering tasks, establishing it as a robust and versatile training framework for reasoning-oriented LLMs.

Problem

Research questions and friction points this paper is trying to address.

Addressing multi-step reasoning failures in small LLMs

Overcoming rigid imitation in supervised fine-tuning

Providing step-wise rewards when correct solutions are rare

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses step-wise action generation for reasoning

Provides smooth rewards from expert action similarity

Combines supervised pre-training with reinforcement learning

🔎 Similar Papers

No similar papers found.