🤖 AI Summary
This study addresses the efficient utilization of expert trajectories in large language model (LLM) post-training, proposing the Plasticity–Ceiling theoretical framework to systematically characterize the synergy between supervised fine-tuning (SFT) and reinforcement learning (RL). Methodologically, it analyzes trajectory selection, scheduling, and scaling through empirical and analytical lenses. Key contributions include: (1) refuting the empirical heuristic that “fewer expert trajectories are always better”; (2) establishing SFT-first followed by RL as the stable optimal paradigm, with a precise switching criterion based on the inflection point of validation loss; and (3) quantifying the complementary interplay between data scale (governing latent capacity) and trajectory difficulty (providing multiplicative gain), identifying minimal validation loss as a robust proxy for trajectory quality. The framework yields significant, reproducible performance gains across multiple LLM benchmarks, offering both theoretical foundations and actionable guidelines for post-training data strategy.
📝 Abstract
While effective post-training integrates Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), the optimal mechanism for utilizing expert trajectories remains unresolved. We propose the Plasticity-Ceiling Framework to theoretically ground this landscape, decomposing performance into foundational SFT performance and the subsequent RL plasticity. Through extensive benchmarking, we establish the Sequential SFT-then-RL pipeline as the superior standard, overcoming the stability deficits of synchronized approaches. Furthermore, we derive precise scaling guidelines: (1) Transitioning to RL at the SFT Stable or Mild Overfitting Sub-phase maximizes the final ceiling by securing foundational SFT performance without compromising RL plasticity; (2) Refuting ``Less is More'' in the context of SFT-then-RL scaling, we demonstrate that Data Scale determines the primary post-training potential, while Trajectory Difficulty acts as a performance multiplier; and (3) Identifying that the Minimum SFT Validation Loss serves as a robust indicator for selecting the expert trajectories that maximize the final performance ceiling. Our findings provide actionable guidelines for maximizing the value extracted from expert trajectories.