DataLens: ML-Oriented Interactive Tabular Data Quality Dashboard

📅 2025-01-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing data management tools suffer from limited automation, poor interactivity, and insufficient integration with ML workflows, compromising data quality and hindering analytical and modeling performance. To address this, we propose an interactive, ML-oriented tabular data quality dashboard that establishes an adaptive, human-in-the-loop + ML-driven data cleaning闭环. Our approach integrates data profiling, multi-strategy error detection and repair—including statistical analysis, rule-based engines, and supervised/semi-supervised models—while supporting expert rule validation and labeling. Cleaning strategies are iteratively refined using downstream model performance feedback. Furthermore, we unify DataSheets, MLflow, and Delta Lake to ensure reproducibility, traceability, and versioning of the cleaning pipeline. Experiments across multiple benchmark datasets demonstrate significant improvements: error identification rate and repair accuracy increase notably, downstream ML models achieve an average 7.2% accuracy gain, and cleaning time decreases by 40%.

Technology Category

Application Category

📝 Abstract
Maintaining high data quality is crucial for reliable data analysis and machine learning (ML). However, existing data quality management tools often lack automation, interactivity, and integration with ML workflows. This demonstration paper introduces DataLens, a novel interactive dashboard designed to streamline and automate the data quality management process for tabular data. DataLens integrates a suite of data profiling, error detection, and repair tools, including statistical, rule-based, and ML-based methods. It features a user-in-the-loop module for interactive rule validation, data labeling, and custom rule definition, enabling domain experts to guide the cleaning process. Furthermore, DataLens implements an iterative cleaning module that automatically selects optimal cleaning tools based on downstream ML model performance. To ensure reproducibility, DataLens generates DataSheets capturing essential metadata and integrates with MLflow and Delta Lake for experiment tracking and data version control. This demonstration showcases DataLens's capabilities in effectively identifying and correcting data errors, improving data quality for downstream tasks, and promoting reproducibility in data cleaning pipelines.
Problem

Research questions and friction points this paper is trying to address.

Data Management
Machine Learning Integration
Data Quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Interactive Tool
Machine Learning Integration
Data Quality Control
🔎 Similar Papers
No similar papers found.
M
Mohamed Abdelaal
Software AG, Darmstadt, Germany
S
Samuel Lokadjaja
TU Darmstadt, Darmstadt, Germany
A
Arne Kreuz
Software AG, Darmstadt, Germany
Harald Schöning
Harald Schöning
Software AG