Generating Skyline Datasets for Data Science Models

📅 2025-02-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Conventional data discovery methods, relying on a single predefined quality criterion, introduce bias into downstream AI models. Method: This paper proposes a skyline-based dataset generation framework for multi-objective optimization, automatically constructing Pareto-optimal data subsets by jointly optimizing user-specified model performance metrics—including accuracy, robustness, and generalizability. Contribution/Results: We introduce MODis, a novel Multi-Objective Finite-State Transducer modeling framework; design three algorithmic strategies—“full-set reduction,” bidirectional alternating addition/removal, and diversity enhancement—to overcome model bias induced by scalar metrics; and integrate Pareto skyline computation, data pruning, and diversity-aware sampling for efficient, scalable dataset discovery across heterogeneous, multi-source settings. Experiments demonstrate significant improvements in the synergistic optimization of multiple model performance objectives.

Technology Category

Application Category

📝 Abstract
Preparing high-quality datasets required by various data-driven AI and machine learning models has become a cornerstone task in data-driven analysis. Conventional data discovery methods typically integrate datasets towards a single pre-defined quality measure that may lead to bias for downstream tasks. This paper introduces MODis, a framework that discovers datasets by optimizing multiple user-defined, model-performance measures. Given a set of data sources and a model, MODis selects and integrates data sources into a skyline dataset, over which the model is expected to have the desired performance in all the performance measures. We formulate MODis as a multi-goal finite state transducer, and derive three feasible algorithms to generate skyline datasets. Our first algorithm adopts a"reduce-from-universal"strategy, that starts with a universal schema and iteratively prunes unpromising data. Our second algorithm further reduces the cost with a bi-directional strategy that interleaves data augmentation and reduction. We also introduce a diversification algorithm to mitigate the bias in skyline datasets. We experimentally verify the efficiency and effectiveness of our skyline data discovery algorithms, and showcase their applications in optimizing data science pipelines.
Problem

Research questions and friction points this paper is trying to address.

Optimizes multiple model-performance measures
Generates skyline datasets for AI models
Reduces bias in data-driven analyses
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimizes multiple user-defined measures
Uses multi-goal finite state transducer
Incorporates diversification to reduce bias
🔎 Similar Papers
2024-05-14arXiv.orgCitations: 2