Rethinking Expert Trajectory Utilization in LLM Post-training

📅 2025-12-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the efficient utilization of expert trajectories in large language model (LLM) post-training, proposing the Plasticity–Ceiling theoretical framework to systematically characterize the synergy between supervised fine-tuning (SFT) and reinforcement learning (RL). Methodologically, it analyzes trajectory selection, scheduling, and scaling through empirical and analytical lenses. Key contributions include: (1) refuting the empirical heuristic that “fewer expert trajectories are always better”; (2) establishing SFT-first followed by RL as the stable optimal paradigm, with a precise switching criterion based on the inflection point of validation loss; and (3) quantifying the complementary interplay between data scale (governing latent capacity) and trajectory difficulty (providing multiplicative gain), identifying minimal validation loss as a robust proxy for trajectory quality. The framework yields significant, reproducible performance gains across multiple LLM benchmarks, offering both theoretical foundations and actionable guidelines for post-training data strategy.

Technology Category

Application Category

📝 Abstract
While effective post-training integrates Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), the optimal mechanism for utilizing expert trajectories remains unresolved. We propose the Plasticity-Ceiling Framework to theoretically ground this landscape, decomposing performance into foundational SFT performance and the subsequent RL plasticity. Through extensive benchmarking, we establish the Sequential SFT-then-RL pipeline as the superior standard, overcoming the stability deficits of synchronized approaches. Furthermore, we derive precise scaling guidelines: (1) Transitioning to RL at the SFT Stable or Mild Overfitting Sub-phase maximizes the final ceiling by securing foundational SFT performance without compromising RL plasticity; (2) Refuting ``Less is More'' in the context of SFT-then-RL scaling, we demonstrate that Data Scale determines the primary post-training potential, while Trajectory Difficulty acts as a performance multiplier; and (3) Identifying that the Minimum SFT Validation Loss serves as a robust indicator for selecting the expert trajectories that maximize the final performance ceiling. Our findings provide actionable guidelines for maximizing the value extracted from expert trajectories.
Problem

Research questions and friction points this paper is trying to address.

Optimizing expert trajectory use in LLM post-training
Establishing superior SFT-then-RL pipeline over synchronized methods
Deriving scaling guidelines for data, difficulty, and loss indicators
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sequential SFT-then-RL pipeline overcomes stability deficits
Transition to RL at SFT stable phase maximizes final ceiling
Data scale determines potential, trajectory difficulty multiplies performance
🔎 Similar Papers
No similar papers found.
B
Bowen Ding
Zhejiang University
Y
Yuhan Chen
School of Engineering, Westlake University
J
Jiayang Lv
School of Engineering, Westlake University
J
Jiyao Yuan
Huawei Noah’s Ark Lab
Q
Qi Zhu
Huawei Noah’s Ark Lab
S
Shuangshuang Tian
School of Engineering, Westlake University
D
Dantong Zhu
School of Engineering, Westlake University
F
Futing Wang
Zhejiang University
H
Heyuan Deng
Huawei Noah’s Ark Lab
Fei Mi
Fei Mi
Huawei Noah's Ark Lab
LLM Post Training
Lifeng Shang
Lifeng Shang
Huawei Noah's Ark Lab
Machine LearningComputer VisionPattern ReconitionNatural Language Processing
T
Tao Lin
School of Engineering, Westlake University