Hierarchical Dataset Selection for High-Quality Data Sharing

📅 2025-12-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In multi-source, heterogeneous data sharing scenarios, efficiently selecting high-value datasets to enhance downstream model performance remains challenging. Method: This paper formally defines the “dataset-level selection” task and proposes a two-tier utility modeling framework that jointly captures heterogeneity across both datasets and data sources (e.g., institutions, domains), enabling few-shot generalization and adaptive selection under resource constraints. We introduce Dataset Selection via Hierarchies (DaSH), integrating hierarchical Bayesian modeling with utility propagation to jointly optimize active exploration and decision-making. Results: Experiments on Digit-Five and DomainNet demonstrate up to a 26.2% accuracy improvement over baselines, significantly reduced exploration steps, and strong robustness under low-resource conditions and critical data absence—establishing a new paradigm for principled, scalable dataset selection in heterogeneous federated settings.

Technology Category

Application Category

📝 Abstract
The success of modern machine learning hinges on access to high-quality training data. In many real-world scenarios, such as acquiring data from public repositories or sharing across institutions, data is naturally organized into discrete datasets that vary in relevance, quality, and utility. Selecting which repositories or institutions to search for useful datasets, and which datasets to incorporate into model training are therefore critical decisions, yet most existing methods select individual samples and treat all data as equally relevant, ignoring differences between datasets and their sources. In this work, we formalize the task of dataset selection: selecting entire datasets from a large, heterogeneous pool to improve downstream performance under resource constraints. We propose Dataset Selection via Hierarchies (DaSH), a dataset selection method that models utility at both dataset and group (e.g., collections, institutions) levels, enabling efficient generalization from limited observations. Across two public benchmarks (Digit-Five and DomainNet), DaSH outperforms state-of-the-art data selection baselines by up to 26.2% in accuracy, while requiring significantly fewer exploration steps. Ablations show DaSH is robust to low-resource settings and lack of relevant datasets, making it suitable for scalable and adaptive dataset selection in practical multi-source learning workflows.
Problem

Research questions and friction points this paper is trying to address.

Selecting high-quality datasets from heterogeneous sources
Modeling utility at dataset and group levels for efficiency
Improving downstream performance under resource constraints
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical modeling of dataset and group utility
Efficient generalization from limited dataset observations
Robust selection in low-resource and sparse settings
🔎 Similar Papers
No similar papers found.