Rethinking Expert Trajectory Utilization in LLM Post-training

📅 2025-12-12

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This study addresses the efficient utilization of expert trajectories in large language model (LLM) post-training, proposing the Plasticity–Ceiling theoretical framework to systematically characterize the synergy between supervised fine-tuning (SFT) and reinforcement learning (RL). Methodologically, it analyzes trajectory selection, scheduling, and scaling through empirical and analytical lenses. Key contributions include: (1) refuting the empirical heuristic that “fewer expert trajectories are always better”; (2) establishing SFT-first followed by RL as the stable optimal paradigm, with a precise switching criterion based on the inflection point of validation loss; and (3) quantifying the complementary interplay between data scale (governing latent capacity) and trajectory difficulty (providing multiplicative gain), identifying minimal validation loss as a robust proxy for trajectory quality. The framework yields significant, reproducible performance gains across multiple LLM benchmarks, offering both theoretical foundations and actionable guidelines for post-training data strategy.

Technology Category

Application Category

📝 Abstract

While effective post-training integrates Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), the optimal mechanism for utilizing expert trajectories remains unresolved. We propose the Plasticity-Ceiling Framework to theoretically ground this landscape, decomposing performance into foundational SFT performance and the subsequent RL plasticity. Through extensive benchmarking, we establish the Sequential SFT-then-RL pipeline as the superior standard, overcoming the stability deficits of synchronized approaches. Furthermore, we derive precise scaling guidelines: (1) Transitioning to RL at the SFT Stable or Mild Overfitting Sub-phase maximizes the final ceiling by securing foundational SFT performance without compromising RL plasticity; (2) Refuting ``Less is More'' in the context of SFT-then-RL scaling, we demonstrate that Data Scale determines the primary post-training potential, while Trajectory Difficulty acts as a performance multiplier; and (3) Identifying that the Minimum SFT Validation Loss serves as a robust indicator for selecting the expert trajectories that maximize the final performance ceiling. Our findings provide actionable guidelines for maximizing the value extracted from expert trajectories.

Problem

Research questions and friction points this paper is trying to address.

Optimizing expert trajectory use in LLM post-training

Establishing superior SFT-then-RL pipeline over synchronized methods

Deriving scaling guidelines for data, difficulty, and loss indicators

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sequential SFT-then-RL pipeline overcomes stability deficits

Transition to RL at SFT stable phase maximizes final ceiling

Data scale determines potential, trajectory difficulty multiplies performance

🔎 Similar Papers

No similar papers found.