🤖 AI Summary
This work addresses training data purification under label noise. We propose a novel proxy-model-driven collaborative framework—Black-Box Optimization–Post-processing–Quantum Annealing (BBO-Post-QA)—that integrates Gaussian process surrogate modeling, iterative black-box optimization, and quantum annealing for subset selection. To our knowledge, this is the first application of a physical quantum annealer (D-Wave) to training set purification: the surrogate model estimates validation error of candidate subsets; black-box optimization guides search; and the D-Wave clique sampler enables efficient, diverse sampling of high-quality clean subsets. A robust post-processing step further refines selections. Experiments on high-noise binary classification tasks demonstrate substantial improvements in downstream model generalization. Compared to classical simulated annealing (OpenJij/Neal), the D-Wave hardware implementation achieves faster convergence and superior subset quality, validating the efficacy and frontier potential of quantum-inspired optimization for data cleaning.
📝 Abstract
This study proposes an approach for removing mislabeled instances from contaminated training datasets by combining surrogate model-based black-box optimization (BBO) with postprocessing and quantum annealing. Mislabeled training instances, a common issue in real-world datasets, often degrade model generalization, necessitating robust and efficient noise-removal strategies. The proposed method evaluates filtered training subsets based on validation loss, iteratively refines loss estimates through surrogate model-based BBO with postprocessing, and leverages quantum annealing to efficiently sample diverse training subsets with low validation error. Experiments on a noisy majority bit task demonstrate the method's ability to prioritize the removal of high-risk mislabeled instances. Integrating D-Wave's clique sampler running on a physical quantum annealer achieves faster optimization and higher-quality training subsets compared to OpenJij's simulated quantum annealing sampler or Neal's simulated annealing sampler, offering a scalable framework for enhancing dataset quality. This work highlights the effectiveness of the proposed method for supervised learning tasks, with future directions including its application to unsupervised learning, real-world datasets, and large-scale implementations.