DataS^3: Dataset Subset Selection for Specialization

📅 2025-04-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses performance degradation of machine learning models in specific deployment environments (e.g., a hospital or national park) due to distributional shift. We formalize the Deployment-Specialized Subset Selection (DS3) task: selecting an optimal subset from a general training set to maximize model performance in the target environment. We introduce DataS³, the first cross-domain, multi-scenario benchmark for DS3, and empirically demonstrate the systematic failure of mainstream data selection methods on this task. To overcome these limitations, we propose a novel DS3 framework integrating coresets, distribution-aware filtering, and human curation—enabling subset optimization even without labeled deployment data. Experiments show that expert-curated subsets yield average accuracy gains of 20.7%, with peaks up to 51.3%, significantly outperforming full-dataset training and existing selection baselines. Moreover, DS3 improves training efficiency and enhances generalization robustness across diverse domains.

Technology Category

Application Category

📝 Abstract
In many real-world machine learning (ML) applications (e.g. detecting broken bones in x-ray images, detecting species in camera traps), in practice models need to perform well on specific deployments (e.g. a specific hospital, a specific national park) rather than the domain broadly. However, deployments often have imbalanced, unique data distributions. Discrepancy between the training distribution and the deployment distribution can lead to suboptimal performance, highlighting the need to select deployment-specialized subsets from the available training data. We formalize dataset subset selection for specialization (DS3): given a training set drawn from a general distribution and a (potentially unlabeled) query set drawn from the desired deployment-specific distribution, the goal is to select a subset of the training data that optimizes deployment performance. We introduce DataS^3; the first dataset and benchmark designed specifically for the DS3 problem. DataS^3 encompasses diverse real-world application domains, each with a set of distinct deployments to specialize in. We conduct a comprehensive study evaluating algorithms from various families--including coresets, data filtering, and data curation--on DataS^3, and find that general-distribution methods consistently fail on deployment-specific tasks. Additionally, we demonstrate the existence of manually curated (deployment-specific) expert subsets that outperform training on all available data with accuracy gains up to 51.3 percent. Our benchmark highlights the critical role of tailored dataset curation in enhancing performance and training efficiency on deployment-specific distributions, which we posit will only become more important as global, public datasets become available across domains and ML models are deployed in the real world.
Problem

Research questions and friction points this paper is trying to address.

Selects training subsets to optimize deployment-specific ML performance
Addresses imbalanced data distributions in specialized real-world applications
Benchmarks methods for deployment-focused dataset curation and efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Selects deployment-specialized subsets from training data
Introduces DataS^3 dataset and benchmark for DS3
Demonstrates expert subsets outperform full training data
🔎 Similar Papers
No similar papers found.
Neha Hulkund
Neha Hulkund
Massachusetts Institute of Technology
Artificial Intelligence
Alaa Maalouf
Alaa Maalouf
CSAIL, MIT
Machine LearningDeep LearningVisionRoboticsBig Data
Levi Cai
Levi Cai
MIT and WHOI
field roboticsmarine roboticsmulti-robot systemsreinforcement learningconstruction robotics
D
Daniel Yang
MIT, Woods Hole Oceanographic Institution
Tsun-Hsuan Wang
Tsun-Hsuan Wang
Massachusetts Institute of Technology
roboticsmachine learningsimulation
Timm Haucke
Timm Haucke
Massachusetts Institute of Technology
wildlife monitoringmachine learningcomputer vision
S
Sandeep Mukherjee
UC Berkeley
Vikram V. Ramaswamy
Vikram V. Ramaswamy
Lecturer, Princeton University
Computer VisionFairness in AIExplainable AI
Judy Hanwen Shen
Judy Hanwen Shen
Stanford University
Algorithmic FairnessDifferential PrivacyAlignment
Gabriel Tseng
Gabriel Tseng
McGill University, Mila - Quebec AI Institute
Mike Walmsley
Mike Walmsley
Postdoctoral Researcher, University of Manchester
Deep learning. Citizen Science. Galaxy morphologygalaxy evolutionmergerstidal features.
Daniela Rus
Daniela Rus
Andrew (1956) and Erna Viterbi Professor of Computer Science, MIT
RoboticsWireless NetworksDistributed Computing
Ken Goldberg
Ken Goldberg
Professor, UC Berkeley and UCSF
RobotsRoboticsAutomationCollaborative Filtering
H
Hannah Kerner
Arizona State University
Irene Y. Chen
Irene Y. Chen
Assistant Professor, UC Berkeley and UC San Francisco
machine learninghealthcareequityprecision health
Y
Yogesh A. Girdhar
Woods Hole Oceanographic Institution
Sara Beery
Sara Beery
Assistant Professor at MIT CSAIL
Computer VisionConservation TechnologyComputational SustainabilityCamera Trapping