General Information Metrics for Improving AI Model Training Efficiency

📅 2025-01-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The escalating scale of training data and the absence of universal data selection methods have led to prohibitively high costs in AI model training. Method: This paper proposes the General Information Measurement and Evaluation (GIME) framework, grounded in Objective Information Theory (OIT). GIME introduces the first systematic, 11-dimensional universal information metric—encompassing volume, latency, granularity, diversity, and others—to establish a cross-task transferable paradigm for quantifying data value. It integrates multidimensional information modeling with data-value-driven active sampling. Results: Extensive experiments across domains—including CTR prediction, civil case judgment forecasting, and weather forecasting—demonstrate that GIME significantly reduces training overhead while preserving model performance. In a judicial AI application, total training cost decreased by 39.56%. GIME thus provides a principled, scalable foundation for efficient, value-aware data curation in large-scale AI development.

Technology Category

Application Category

📝 Abstract
To address the growing size of AI model training data and the lack of a universal data selection methodology-factors that significantly drive up training costs -- this paper presents the General Information Metrics Evaluation (GIME) method. GIME leverages general information metrics from Objective Information Theory (OIT), including volume, delay, scope, granularity, variety, duration, sampling rate, aggregation, coverage, distortion, and mismatch to optimize dataset selection for training purposes. Comprehensive experiments conducted across diverse domains, such as CTR Prediction, Civil Case Prediction, and Weather Forecasting, demonstrate that GIME effectively preserves model performance while substantially reducing both training time and costs. Additionally, applying GIME within the Judicial AI Program led to a remarkable 39.56% reduction in total model training expenses, underscoring its potential to support efficient and sustainable AI development.
Problem

Research questions and friction points this paper is trying to address.

AI model training
data volume
cost efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

GIME
Data Selection
Cost Reduction
🔎 Similar Papers
No similar papers found.
Jianfeng Xu
Jianfeng Xu
Huazhong University of Science and Technology
Precision ManufacturingMicrosystemDynamics
C
Congcong Liu
Institute for Smart Courts, Shanghai Jiao Tong University, Shanghai, 200030, China.
X
Xiaoying Tan
China Judicial Big Data Research Institute Co., Ltd., Beijing, 100035, China.
Xiaojie Zhu
Xiaojie Zhu
Staff Research Scientist
Data PrivacyApplied CryptographyCybersecurityDistributed System
Anpeng Wu
Anpeng Wu
Zhejiang University
ML: Causal LearningRepresentation LearningExplainable AI
H
Huan Wan
iFLYTEK Co., Ltd., Hefei, 230088, China.
W
Weijun Kong
iFLYTEK Co., Ltd., Hefei, 230088, China.
Chun Li
Chun Li
MD Anderson Cancer Center
diagnostic imagingdrug deliverynanotechnology
H
Hu Xu
Institute for Smart Courts, Shanghai Jiao Tong University, Shanghai, 200030, China.
Kun Kuang
Kun Kuang
Zhejiang University
Causal InferenceData MiningMachine Learning