What is different between these datasets?

📅 2024-03-08

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Existing methods struggle to align and interpret distribution shifts across heterogeneous, domain-consistent datasets—such as tabular, textual, visual, and time-series data—especially when scale and modality disparities are pronounced, resulting in poor interpretability. This paper introduces the first human-centric, cross-modal distribution discrepancy explanation framework, implemented as an interpretable dataset comparison toolbox. It integrates statistical hypothesis testing, feature importance decomposition, class activation mapping (CAM), contrastive representation learning, and interpretable generative modeling to enable fine-grained, semantically readable attribution and visualization of distributional shifts. Evaluated across diverse real-world scenarios, the framework significantly improves users’ efficiency in understanding shift causes and enhances the accuracy of intervention decisions—thereby overcoming the limitations of conventional black-box shift detection approaches.

Technology Category

Application Category

📝 Abstract

The performance of machine learning models relies heavily on the quality of input data, yet real-world applications often face significant data-related challenges. A common issue arises when curating training data or deploying models: two datasets from the same domain may exhibit differing distributions. While many techniques exist for detecting such distribution shifts, there is a lack of comprehensive methods to explain these differences in a human-understandable way beyond opaque quantitative metrics. To bridge this gap, we propose a versatile toolbox of interpretable methods for comparing datasets. Using a variety of case studies, we demonstrate the effectiveness of our approach across diverse data modalities -- including tabular data, text data, images, time series signals -- in both low and high-dimensional settings. These methods complement existing techniques by providing actionable and interpretable insights to better understand and address distribution shifts.

Problem

Research questions and friction points this paper is trying to address.

Data Interpretation

Inter-data Differentiation

Multi-type Data Analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Data Comparison

Multi-Type Data Analysis

Enhanced Understanding of Data Differences

🔎 Similar Papers

MLLM-CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs