Collaborative Prediction: To Join or To Disjoin Datasets

📅 2025-06-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the optimal selection problem in multi-source dataset fusion, aiming to minimize the overall prediction loss of a target model. We propose a theoretically grounded dataset selection algorithm that establishes the first oracle inequality-based fusion criterion applicable to simple predictive models—such as linear regression—thereby enabling provable loss reduction with high probability via data-driven estimators. The method tightly integrates statistical learning theory with practical dataset selection, ensuring both theoretical rigor and empirical applicability. It is validated across standard linear regression settings and broader machine learning tasks, consistently achieving significant reductions in overall prediction error. Extensive experiments confirm its effectiveness, and the implementation is publicly available as open-source software.

Technology Category

Application Category

📝 Abstract
With the recent rise of generative Artificial Intelligence (AI), the need of selecting high-quality dataset to improve machine learning models has garnered increasing attention. However, some part of this topic remains underexplored, even for simple prediction models. In this work, we study the problem of developing practical algorithms that select appropriate dataset to minimize population loss of our prediction model with high probability. Broadly speaking, we investigate when datasets from different sources can be effectively merged to enhance the predictive model's performance, and propose a practical algorithm with theoretical guarantees. By leveraging an oracle inequality and data-driven estimators, the algorithm reduces population loss with high probability. Numerical experiments demonstrate its effectiveness in both standard linear regression and broader machine learning applications. Code is available at https://github.com/kkrokii/collaborative_prediction.
Problem

Research questions and friction points this paper is trying to address.

Selecting high-quality datasets to improve prediction models
Determining when to merge datasets for better model performance
Developing algorithms to minimize population loss with guarantees
Innovation

Methods, ideas, or system contributions that make the work stand out.

Algorithm selects datasets to minimize loss
Leverages oracle inequality and data-driven estimators
Effective in linear regression and ML applications
🔎 Similar Papers
No similar papers found.
Kyung Rok Kim
Kyung Rok Kim
University of North Carolina at Chapel Hill
Y
Yansong Wang
University of Science and Technology of China
Xiaocheng Li
Xiaocheng Li
Imperial College Business School, Imperial College London
Machine learningoperations research
G
Guanting Chen
University of North Carolina at Chapel Hill