Enhancing Machine Learning Performance through Intelligent Data Quality Assessment: An Unsupervised Data-centric Framework

📅 2025-02-18

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Severe data quality degradation critically impairs machine learning (ML) performance. To address this, we propose a label-free, data-centric quality assessment framework. Our method innovatively integrates domain-knowledge-driven quality metric design with an unsupervised data stratification mechanism, enabling cross-domain generalization and establishing a closed loop of quality assessment, model optimization, and experimental validation. Evaluated on three real-world antisense oligonucleotide datasets, the framework accurately identifies high-quality samples, significantly boosting downstream ML model performance—e.g., average AUC improvement of 12.3%—and effectively guiding efficient laboratory experiment design. Its core contribution lies in enabling unsupervised, interpretable, and transferable autonomous data quality assessment and utilization.

Technology Category

Application Category

📝 Abstract

Poor data quality limits the advantageous power of Machine Learning (ML) and weakens high-performing ML software systems. Nowadays, data are more prone to the risk of poor quality due to their increasing volume and complexity. Therefore, tedious and time-consuming work goes into data preparation and improvement before moving further in the ML pipeline. To address this challenge, we propose an intelligent data-centric evaluation framework that can identify high-quality data and improve the performance of an ML system. The proposed framework combines the curation of quality measurements and unsupervised learning to distinguish high- and low-quality data. The framework is designed to integrate flexible and general-purpose methods so that it is deployed in various domains and applications. To validate the outcomes of the designed framework, we implemented it in a real-world use case from the field of analytical chemistry, where it is tested on three datasets of anti-sense oligonucleotides. A domain expert is consulted to identify the relevant quality measurements and evaluate the outcomes of the framework. The results show that the quality-centric data evaluation framework identifies the characteristics of high-quality data that guide the conduct of efficient laboratory experiments and consequently improve the performance of the ML system.

Problem

Research questions and friction points this paper is trying to address.

Enhances ML performance via data quality assessment

Identifies high-quality data using unsupervised learning

Improves ML system efficiency in various domains

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised learning for data quality

Flexible general-purpose evaluation framework

Domain-specific quality measurement curation

🔎 Similar Papers

No similar papers found.