Distilling the Implicit Multi-Branch Structure in LLMs' Reasoning via Reinforcement Learning

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Standard supervised fine-tuning (SFT) in large language model (LLM) knowledge distillation enforces sequential output generation, causing structural collapse—specifically, the loss of the teacher’s implicit multi-branch reasoning topology, characterized by alternating meta-reasoning and solution-generation phases. Method: We propose RLKD, a reinforcement learning–based reasoning distillation framework. Its core innovation is the formalization of the cognitive neuroscience–inspired meta-reasoning–solution dual-phase mechanism as an optimizable implicit multi-branch structure, coupled with a generative structural reward model (GSRM) for path-level structural alignment assessment. Contribution/Results: RLKD overcomes the structural collapse bottleneck inherent in SFT. With only 0.1% training data, it significantly outperforms standard SFT–RL pipelines, achieving an average 12.7% improvement on mathematical and symbolic reasoning benchmarks—particularly enhancing students’ logical decomposition, subproblem selection, and cross-step reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Distilling reasoning paths from teacher to student models via supervised fine-tuning (SFT) provides a shortcut for improving the reasoning ability of smaller Large Language Models (LLMs). However, the reasoning paths generated by teacher models often reflect only surface-level traces of their underlying authentic reasoning. Insights from cognitive neuroscience suggest that authentic reasoning involves a complex interweaving between meta-reasoning (which selects appropriate sub-problems from multiple candidates) and solving (which addresses the sub-problem). This implies authentic reasoning has an implicit multi-branch structure. Supervised fine-tuning collapses this rich structure into a flat sequence of token prediction in the teacher's reasoning path, preventing effective distillation of this structure to students. To address this limitation, we propose RLKD, a reinforcement learning (RL)-based distillation framework guided by a novel Generative Structure Reward Model (GSRM). Our GSRM converts reasoning paths into multiple meta-reasoning-solving steps and computes rewards to measure structural alignment between student and teacher reasoning. RLKD combines this reward with RL, enabling student LLMs to internalize the teacher's implicit multi-branch reasoning structure rather than merely mimicking fixed output paths. Experiments show RLKD surpasses standard SFT-RL pipelines even when trained on 0.1% of data under an RL-only regime, unlocking greater student reasoning potential than SFT-based distillation.

Problem

Research questions and friction points this paper is trying to address.

Enhances distillation of teacher LLMs' implicit multi-branch reasoning to students

Replaces flat SFT with RL-based structural alignment via Generative Structure Reward Model

Improves student reasoning by internalizing meta-reasoning-solving steps, not just surface paths

Innovation

Methods, ideas, or system contributions that make the work stand out.

RL-based distillation framework RLKD

Generative Structure Reward Model GSRM

Internalizes multi-branch reasoning structure

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting