🤖 AI Summary
Severe data quality degradation critically impairs machine learning (ML) performance. To address this, we propose a label-free, data-centric quality assessment framework. Our method innovatively integrates domain-knowledge-driven quality metric design with an unsupervised data stratification mechanism, enabling cross-domain generalization and establishing a closed loop of quality assessment, model optimization, and experimental validation. Evaluated on three real-world antisense oligonucleotide datasets, the framework accurately identifies high-quality samples, significantly boosting downstream ML model performance—e.g., average AUC improvement of 12.3%—and effectively guiding efficient laboratory experiment design. Its core contribution lies in enabling unsupervised, interpretable, and transferable autonomous data quality assessment and utilization.
📝 Abstract
Poor data quality limits the advantageous power of Machine Learning (ML) and weakens high-performing ML software systems. Nowadays, data are more prone to the risk of poor quality due to their increasing volume and complexity. Therefore, tedious and time-consuming work goes into data preparation and improvement before moving further in the ML pipeline. To address this challenge, we propose an intelligent data-centric evaluation framework that can identify high-quality data and improve the performance of an ML system. The proposed framework combines the curation of quality measurements and unsupervised learning to distinguish high- and low-quality data. The framework is designed to integrate flexible and general-purpose methods so that it is deployed in various domains and applications. To validate the outcomes of the designed framework, we implemented it in a real-world use case from the field of analytical chemistry, where it is tested on three datasets of anti-sense oligonucleotides. A domain expert is consulted to identify the relevant quality measurements and evaluate the outcomes of the framework. The results show that the quality-centric data evaluation framework identifies the characteristics of high-quality data that guide the conduct of efficient laboratory experiments and consequently improve the performance of the ML system.