Enhancing Machine Learning Performance through Intelligent Data Quality Assessment: An Unsupervised Data-centric Framework

📅 2025-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Severe data quality degradation critically impairs machine learning (ML) performance. To address this, we propose a label-free, data-centric quality assessment framework. Our method innovatively integrates domain-knowledge-driven quality metric design with an unsupervised data stratification mechanism, enabling cross-domain generalization and establishing a closed loop of quality assessment, model optimization, and experimental validation. Evaluated on three real-world antisense oligonucleotide datasets, the framework accurately identifies high-quality samples, significantly boosting downstream ML model performance—e.g., average AUC improvement of 12.3%—and effectively guiding efficient laboratory experiment design. Its core contribution lies in enabling unsupervised, interpretable, and transferable autonomous data quality assessment and utilization.

Technology Category

Application Category

📝 Abstract
Poor data quality limits the advantageous power of Machine Learning (ML) and weakens high-performing ML software systems. Nowadays, data are more prone to the risk of poor quality due to their increasing volume and complexity. Therefore, tedious and time-consuming work goes into data preparation and improvement before moving further in the ML pipeline. To address this challenge, we propose an intelligent data-centric evaluation framework that can identify high-quality data and improve the performance of an ML system. The proposed framework combines the curation of quality measurements and unsupervised learning to distinguish high- and low-quality data. The framework is designed to integrate flexible and general-purpose methods so that it is deployed in various domains and applications. To validate the outcomes of the designed framework, we implemented it in a real-world use case from the field of analytical chemistry, where it is tested on three datasets of anti-sense oligonucleotides. A domain expert is consulted to identify the relevant quality measurements and evaluate the outcomes of the framework. The results show that the quality-centric data evaluation framework identifies the characteristics of high-quality data that guide the conduct of efficient laboratory experiments and consequently improve the performance of the ML system.
Problem

Research questions and friction points this paper is trying to address.

Enhances ML performance via data quality assessment
Identifies high-quality data using unsupervised learning
Improves ML system efficiency in various domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised learning for data quality
Flexible general-purpose evaluation framework
Domain-specific quality measurement curation
🔎 Similar Papers
No similar papers found.
M
Manal Rahal
Department of Mathematics and Computer Science, Karlstad University, Universitetsgatan 2, Karlstad, 65188, Sweden
Bestoun S. Ahmed
Bestoun S. Ahmed
Professor in Computer Science, Karlstad University
Software TestingSoftware EngineeringSE4AIMLOps
G
Gergely Szabados
Department of Engineering and Chemical Sciences, Karlstad, Universitetsgatan 2, Karlstad, 65188, Sweden
T
T. Fornstedt
Department of Engineering and Chemical Sciences, Karlstad, Universitetsgatan 2, Karlstad, 65188, Sweden
J
Jörgen Samuelsson
Department of Engineering and Chemical Sciences, Karlstad, Universitetsgatan 2, Karlstad, 65188, Sweden