🤖 AI Summary
To address downstream performance degradation caused by domain sampling imbalance in multi-domain training, this paper proposes a data sampling framework that jointly ensures intra-domain consistency and quantifies inter-domain influence. Methodologically: (1) it introduces the first gradient-clustering-based mechanism to enforce intra-domain consistency; (2) it designs a Fisher Information Matrix (FIM)-guided domain influence metric with theoretical interpretability; and (3) it integrates loss trajectory analysis with marginal gain decay modeling to dynamically optimize domain sampling ratios. Technically, the approach leverages a proxy language model, dimensionality reduction for acceleration, and efficient FIM estimation—achieving these improvements without increasing training overhead. Empirical evaluation demonstrates an average 3.4% improvement in downstream task performance across diverse benchmarks.
📝 Abstract
Large language models (LLMs) are commonly trained on multi-domain datasets, where domain sampling strategies significantly impact model performance due to varying domain importance across downstream tasks. Existing approaches for optimizing domain-level sampling strategies struggle with maintaining intra-domain consistency and accurately measuring domain impact. In this paper, we present Domain Impact-aware Data Sampling (DIDS). To ensure intra-domain consistency, a gradient clustering algorithm is proposed to group training data based on their learning effects, where a proxy language model and dimensionality reduction are employed to reduce computational overhead. To accurately measure domain impact, we develop a Fisher Information Matrix (FIM) guided metric that quantifies how domain-specific parameter updates affect the model's output distributions on downstream tasks, with theoretical guarantees. Furthermore, to determine optimal sampling ratios, DIDS combines both the FIM-guided domain impact assessment and loss learning trajectories that indicate domain-specific potential, while accounting for diminishing marginal returns. Extensive experiments demonstrate that DIDS achieves 3.4% higher average performance while maintaining comparable training efficiency.