Massive Supervised Fine-tuning Experiments Reveal How Data, Layer, and Training Factors Shape LLM Alignment Quality

📅 2025-06-17

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Understanding how data characteristics, layer-wise model changes, and training factors jointly influence human value alignment quality during supervised fine-tuning (SFT) of large language models (LLMs) remains an open challenge. Method: We conducted over 1,000 controlled SFT experiments, integrating layer-wise attribution analysis, cross-task benchmark evaluation, and quantitative modeling of data attributes. Contribution/Results: We identify—first time—that weight changes in middle transformer layers correlate most strongly with alignment improvement; validate perplexity—not surface-level data similarity—as a more stable and reliable predictor of SFT efficacy; and demonstrate that optimal SFT strategies must be customized to model architecture. We publicly release >1,000 fine-tuned models and evaluation results, distill core data attributes governing alignment, and introduce the first multidimensional predictive metric suite for SFT effectiveness.

Technology Category

Application Category

📝 Abstract

Supervised fine-tuning (SFT) is a critical step in aligning large language models (LLMs) with human instructions and values, yet many aspects of SFT remain poorly understood. We trained a wide range of base models on a variety of datasets including code generation, mathematical reasoning, and general-domain tasks, resulting in 1,000+ SFT models under controlled conditions. We then identified the dataset properties that matter most and examined the layer-wise modifications introduced by SFT. Our findings reveal that some training-task synergies persist across all models while others vary substantially, emphasizing the importance of model-specific strategies. Moreover, we demonstrate that perplexity consistently predicts SFT effectiveness--often surpassing superficial similarity between trained data and benchmark--and that mid-layer weight changes correlate most strongly with performance gains. We will release these 1,000+ SFT models and benchmark results to accelerate further research.

Problem

Research questions and friction points this paper is trying to address.

Understanding how data properties affect LLM alignment quality

Identifying layer-wise modifications impacting SFT performance gains

Determining predictive factors for supervised fine-tuning effectiveness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extensive controlled SFT experiments on diverse datasets

Layer-wise analysis reveals key weight change patterns

Perplexity outperforms data similarity for SFT effectiveness prediction

🔎 Similar Papers

Does Alignment Tuning Really Break LLMs' Internal Confidence?