Data Difficulty and the Generalization--Extrapolation Tradeoff in LLM Fine-Tuning

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This study addresses the absence of a principled criterion for selecting data difficulty in supervised fine-tuning of large language models, which has led to inconsistent performance with heuristic approaches. Through controlled synthetic experiments and PAC-Bayesian theoretical analysis, the work systematically investigates how data difficulty influences model generalization and extrapolation capabilities, uncovering a synergistic mechanism between dataset scale and difficulty. The findings reveal that optimal data difficulty increases with the available data budget, leading to a unified framework that explains the trade-off between generalization and extrapolation. This work provides the first quantitative characterization of the interplay among data difficulty, dataset size, and model performance, offering both theoretical grounding and practical guidance for difficulty-aware, data-efficient fine-tuning strategies.

📝 Abstract

Data selection during supervised fine-tuning (SFT) can critically change the behavior of large language models (LLMs). Although existing work has studied the effect of selecting data based on heuristics such as perplexity, difficulty, or length, the reported findings are often inconsistent or context-dependent. In this work, we systematically study the role of data difficulty in fine-tuning from both empirical and theoretical perspectives, and find that there is no universally optimal difficulty level; rather, its effectiveness depends on the dataset size. We show that for a fixed data budget, there exists an optimal data difficulty for SFT, and that this optimal difficulty shifts toward harder data as the data budget increases. To explain this phenomenon, we conduct controlled synthetic experiments that reveal a simple underlying mechanism: the interplay between the (in-distribution) generalization gap and the extrapolation gap. We further support this mechanism through a theoretical analysis using PAC-Bayesian generalization bounds. Overall, our results clarify how data size and difficulty jointly affect the trade-off between generalization and extrapolation in SFT, providing guidance for difficulty-based data selection under certain model and data conditions.

Problem

Research questions and friction points this paper is trying to address.

data difficulty

generalization

extrapolation

fine-tuning

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

data difficulty

generalization-extrapolation tradeoff

supervised fine-tuning