Distinctiveness Maximization in Datasets Assemblage

📅 2024-01-01

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This paper addresses the NP-hard problem of selecting complementary datasets from heterogeneous data sources under a budget constraint to maximize the number of unique tuples added to a base dataset—thereby enhancing its diversity. To tackle this, we first formulate a novel joint cardinality estimation task that operates across datasets and queries. We then design a machine learning–based estimator for marginal uniqueness gain, circumventing the computational intractability of exact enumeration. Integrating greedy optimization with approximation algorithms, our approach achieves efficient and scalable solutions. Extensive experiments on five real-world data pools demonstrate that our method significantly outperforms state-of-the-art baselines in accuracy, runtime efficiency, and scalability. Furthermore, downstream machine learning tasks validate the practical utility of the selected data, confirming measurable performance gains in model training and generalization.

Technology Category

Application Category

📝 Abstract

In this paper, given a user's query set and budget, we aim to use the limited budget to help users assemble a set of datasets that can enrich a base dataset by introducing the maximum number of distinct tuples (i.e., maximizing distinctiveness). We prove this problem to be NP-hard. A greedy algorithm using exact distinctiveness computation attains an approximation ratio of (1-1/e)/2, but it lacks efficiency and scalability due to its frequent computation of the exact distinctiveness marginal gain of any candidate dataset for selection. This requires scanning through every tuple in candidate datasets and thus is unaffordable in practice. To overcome this limitation, we propose an efficient machine learning (ML)-based method for estimating the distinctiveness marginal gain of any candidate dataset. This effectively eliminates the need to test each tuple individually. Estimating the distinctiveness marginal gain of a dataset involves estimating the number of distinct tuples in the tuple sets returned by each query in a query set across multiple datasets. This can be viewed as the cardinality estimation for a query set on a set of datasets, and the proposed method is the first to tackle this cardinality estimation problem. This is a significant advancement over prior methods that were limited to single-query cardinality estimation on a single dataset and struggled with identifying overlaps among tuple sets returned by each query in a query set across multiple datasets. Extensive experiments using five real-world data pools demonstrate that our algorithm, which utilizes ML-based distinctiveness estimation, outperforms all relevant baselines in effectiveness, efficiency, and scalability. A case study on two downstream ML tasks also highlights its potential to find datasets with more useful tuples to enhance the performance of ML tasks.

Problem

Research questions and friction points this paper is trying to address.

Maximize distinct tuples in assembled datasets

Efficient ML-based distinctiveness marginal gain estimation

Cardinality estimation for query sets across datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

ML-based distinctiveness estimation

Efficient cardinality estimation

Scalable dataset assemblage algorithm

🔎 Similar Papers

No similar papers found.

Authors to Follow