🤖 AI Summary
The escalating scale of training data and the absence of universal data selection methods have led to prohibitively high costs in AI model training.
Method: This paper proposes the General Information Measurement and Evaluation (GIME) framework, grounded in Objective Information Theory (OIT). GIME introduces the first systematic, 11-dimensional universal information metric—encompassing volume, latency, granularity, diversity, and others—to establish a cross-task transferable paradigm for quantifying data value. It integrates multidimensional information modeling with data-value-driven active sampling.
Results: Extensive experiments across domains—including CTR prediction, civil case judgment forecasting, and weather forecasting—demonstrate that GIME significantly reduces training overhead while preserving model performance. In a judicial AI application, total training cost decreased by 39.56%. GIME thus provides a principled, scalable foundation for efficient, value-aware data curation in large-scale AI development.
📝 Abstract
To address the growing size of AI model training data and the lack of a universal data selection methodology-factors that significantly drive up training costs -- this paper presents the General Information Metrics Evaluation (GIME) method. GIME leverages general information metrics from Objective Information Theory (OIT), including volume, delay, scope, granularity, variety, duration, sampling rate, aggregation, coverage, distortion, and mismatch to optimize dataset selection for training purposes. Comprehensive experiments conducted across diverse domains, such as CTR Prediction, Civil Case Prediction, and Weather Forecasting, demonstrate that GIME effectively preserves model performance while substantially reducing both training time and costs. Additionally, applying GIME within the Judicial AI Program led to a remarkable 39.56% reduction in total model training expenses, underscoring its potential to support efficient and sustainable AI development.