🤖 AI Summary
This work addresses capability skew in multi-competency language models induced by imbalanced domain data proportions during mixed-supervision supervised fine-tuning (SFT). We propose the Data Equilibrium Adaptation Framework (DEAF), whose core innovation is the first establishment of a differentiable gradient-based linkage between domain-wise data proportions and the emergence of model competencies—enabling adaptive optimization of data distributions guided by downstream multi-task performance feedback. DEAF comprises three key components: gradient-driven iterative data reweighting, domain-level dynamic data volume adjustment, and a joint multi-task evaluation feedback mechanism. Evaluated on standard multi-task benchmarks, DEAF improves overall model performance by approximately 7%, while significantly enhancing cross-domain robustness and competency balance. The framework introduces a novel, differentiable, and optimization-friendly paradigm for data proportioning in multi-competency alignment.
📝 Abstract
Large Language Models (LLMs) have achieved impressive performance through Supervised Fine-tuning (SFT) on diverse instructional datasets. When training on multiple capabilities simultaneously, the mixture training dataset, governed by volumes of data from different domains, is a critical factor that directly impacts the final model's performance. Unlike many studies that focus on enhancing the quality of training datasets through data selection methods, few works explore the intricate relationship between the compositional quantity of mixture training datasets and the emergent capabilities of LLMs. Given the availability of a high-quality multi-domain training dataset, understanding the impact of data from each domain on the model's overall capabilities is crucial for preparing SFT data and training a well-balanced model that performs effectively across diverse domains. In this work, we introduce IDEAL, an innovative data equilibrium adaptation framework designed to effectively optimize volumes of data from different domains within mixture SFT datasets, thereby enhancing the model's alignment and performance across multiple capabilities. IDEAL employs a gradient-based approach to iteratively refine the training data distribution, dynamically adjusting the volumes of domain-specific data based on their impact on downstream task performance. By leveraging this adaptive mechanism, IDEAL ensures a balanced dataset composition, enabling the model to achieve robust generalization and consistent proficiency across diverse tasks. Experiments across different capabilities demonstrate that IDEAL outperforms conventional uniform data allocation strategies, achieving a comprehensive improvement of approximately 7% in multi-task evaluation scores.