🤖 AI Summary
To address insufficient high-quality data filtering in large language model pretraining, this paper proposes a flexible and scalable multi-signal fusion framework for data quality assessment. The method introduces two novel mechanisms: (i) “multi-signal alignment,” which maps heterogeneous quality scores into a unified embedding space to ensure consistency across diverse signals; and (ii) “orthogonality-aware progressive selection,” which enforces orthogonality constraints to preserve signal complementarity and employs multi-rater weighted ensemble scoring to enhance robustness. Designed for computational efficiency, the framework improves both comprehensiveness of quality evaluation and dynamic adaptability in data selection. Experiments on the SlimPajama dataset demonstrate an average 2.9% improvement in downstream task performance and over 50% reduction in FLOPs required to achieve comparable performance, validating its efficiency and generalizability.
📝 Abstract
Selecting high-quality data can significantly improve the pre-training efficiency of large language models (LLMs). Existing methods often rely on heuristic techniques and single quality signals, limiting their ability to comprehensively evaluate data quality. In this work, we propose FIRE, a flexible and scalable framework for integrating multiple data quality raters, which allows for a comprehensive assessment of data quality across various dimensions. FIRE aligns multiple quality signals into a unified space, and integrates diverse data quality raters to provide a comprehensive quality signal for each data point. Further, we introduce a progressive data selection scheme based on FIRE that iteratively refines the selection of high-quality data points, balancing computational complexity with the refinement of orthogonality. Experiments on the SlimPajama dataset reveal that FIRE consistently outperforms other selection methods and significantly enhances the pre-trained model across a wide range of downstream tasks, with a 2.9% average performance boost and reducing the FLOPs necessary to achieve a certain performance level by more than half.