Collaborative Prediction: To Join or To Disjoin Datasets

📅 2025-06-12

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This paper addresses the optimal selection problem in multi-source dataset fusion, aiming to minimize the overall prediction loss of a target model. We propose a theoretically grounded dataset selection algorithm that establishes the first oracle inequality-based fusion criterion applicable to simple predictive models—such as linear regression—thereby enabling provable loss reduction with high probability via data-driven estimators. The method tightly integrates statistical learning theory with practical dataset selection, ensuring both theoretical rigor and empirical applicability. It is validated across standard linear regression settings and broader machine learning tasks, consistently achieving significant reductions in overall prediction error. Extensive experiments confirm its effectiveness, and the implementation is publicly available as open-source software.

Technology Category

Application Category

📝 Abstract

With the recent rise of generative Artificial Intelligence (AI), the need of selecting high-quality dataset to improve machine learning models has garnered increasing attention. However, some part of this topic remains underexplored, even for simple prediction models. In this work, we study the problem of developing practical algorithms that select appropriate dataset to minimize population loss of our prediction model with high probability. Broadly speaking, we investigate when datasets from different sources can be effectively merged to enhance the predictive model's performance, and propose a practical algorithm with theoretical guarantees. By leveraging an oracle inequality and data-driven estimators, the algorithm reduces population loss with high probability. Numerical experiments demonstrate its effectiveness in both standard linear regression and broader machine learning applications. Code is available at https://github.com/kkrokii/collaborative_prediction.

Problem

Research questions and friction points this paper is trying to address.

Selecting high-quality datasets to improve prediction models

Determining when to merge datasets for better model performance

Developing algorithms to minimize population loss with guarantees

Innovation

Methods, ideas, or system contributions that make the work stand out.

Algorithm selects datasets to minimize loss

Leverages oracle inequality and data-driven estimators

Effective in linear regression and ML applications

🔎 Similar Papers

The Role of Graph Topology in the Performance of Biomedical Knowledge Graph Completion Models