FIRE: Flexible Integration of Data Quality Ratings for Effective Pre-Training

📅 2025-02-02

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

To address insufficient high-quality data filtering in large language model pretraining, this paper proposes a flexible and scalable multi-signal fusion framework for data quality assessment. The method introduces two novel mechanisms: (i) “multi-signal alignment,” which maps heterogeneous quality scores into a unified embedding space to ensure consistency across diverse signals; and (ii) “orthogonality-aware progressive selection,” which enforces orthogonality constraints to preserve signal complementarity and employs multi-rater weighted ensemble scoring to enhance robustness. Designed for computational efficiency, the framework improves both comprehensiveness of quality evaluation and dynamic adaptability in data selection. Experiments on the SlimPajama dataset demonstrate an average 2.9% improvement in downstream task performance and over 50% reduction in FLOPs required to achieve comparable performance, validating its efficiency and generalizability.

Technology Category

Application Category

📝 Abstract

Selecting high-quality data can significantly improve the pre-training efficiency of large language models (LLMs). Existing methods often rely on heuristic techniques and single quality signals, limiting their ability to comprehensively evaluate data quality. In this work, we propose FIRE, a flexible and scalable framework for integrating multiple data quality raters, which allows for a comprehensive assessment of data quality across various dimensions. FIRE aligns multiple quality signals into a unified space, and integrates diverse data quality raters to provide a comprehensive quality signal for each data point. Further, we introduce a progressive data selection scheme based on FIRE that iteratively refines the selection of high-quality data points, balancing computational complexity with the refinement of orthogonality. Experiments on the SlimPajama dataset reveal that FIRE consistently outperforms other selection methods and significantly enhances the pre-trained model across a wide range of downstream tasks, with a 2.9% average performance boost and reducing the FLOPs necessary to achieve a certain performance level by more than half.

Problem

Research questions and friction points this paper is trying to address.

Pre-training Efficiency

Data Quality Assessment

Performance Improvement

Innovation

Methods, ideas, or system contributions that make the work stand out.

FIRE System

Progressive Data Selection

Comprehensive Data Quality Evaluation

🔎 Similar Papers

No similar papers found.