🤖 AI Summary
This paper addresses the optimal selection problem in multi-source dataset fusion, aiming to minimize the overall prediction loss of a target model. We propose a theoretically grounded dataset selection algorithm that establishes the first oracle inequality-based fusion criterion applicable to simple predictive models—such as linear regression—thereby enabling provable loss reduction with high probability via data-driven estimators. The method tightly integrates statistical learning theory with practical dataset selection, ensuring both theoretical rigor and empirical applicability. It is validated across standard linear regression settings and broader machine learning tasks, consistently achieving significant reductions in overall prediction error. Extensive experiments confirm its effectiveness, and the implementation is publicly available as open-source software.
📝 Abstract
With the recent rise of generative Artificial Intelligence (AI), the need of selecting high-quality dataset to improve machine learning models has garnered increasing attention. However, some part of this topic remains underexplored, even for simple prediction models. In this work, we study the problem of developing practical algorithms that select appropriate dataset to minimize population loss of our prediction model with high probability. Broadly speaking, we investigate when datasets from different sources can be effectively merged to enhance the predictive model's performance, and propose a practical algorithm with theoretical guarantees. By leveraging an oracle inequality and data-driven estimators, the algorithm reduces population loss with high probability. Numerical experiments demonstrate its effectiveness in both standard linear regression and broader machine learning applications. Code is available at https://github.com/kkrokii/collaborative_prediction.