Proper Dataset Valuation by Pointwise Mutual Information

📅 2024-05-28

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Test-set overfitting in data curation evaluation distorts quality assessment. Method: We propose an information-theoretic framework for data valuation, introducing Shannon mutual information to quantify dataset value—specifically, defining a parameter-level informativeness metric that measures how well data characterize the true model parameters. Grounded in the Blackwell order, we design a mutual information estimator leveraging Bayesian posteriors and data embeddings, circumventing the Goodhart’s law risk inherent in conventional test-accuracy-based evaluation. Contribution/Results: Experiments demonstrate that our method effectively identifies and penalizes spurious optimization strategies (e.g., excessive cleaning causing information loss), whereas standard evaluation erroneously rewards overfitted solutions. On real-world datasets, it exhibits superior robustness and discriminative power in distinguishing high-value from low-value data.

Technology Category

Application Category

📝 Abstract

Data plays a central role in the development of modern artificial intelligence, with high-quality data emerging as a key driver of model performance. This has prompted the development of various data curation methods in recent years. However, measuring the effectiveness of these data curation techniques remains a major challenge. Traditional evaluation methods, which assess a trained model's performance on specific benchmarks, risk promoting practices that merely make the data more similar to the test data. This issue exemplifies Goodhart's law: when a measure becomes a target, it ceases to be a good measure. To address this, we propose an information-theoretic framework for evaluating data curation methods, where dataset quality is measured by its informativeness about the true model parameters using the Blackwell ordering. We compare informativeness by the Shannon mutual information of the evaluated data and the test data, and we propose a novel method for estimating the mutual information of datasets by training Bayesian models on embedded data and computing the mutual information from the model's parameter posteriors. Experiments on real-world data demonstrate that our mutual information-based evaluation assigns appropriately lower scores to data curation strategies that reduce dataset informativeness, while traditional test score-based evaluation methods may favor data curation strategies that overfit to the test set but compromise the training data's informativeness.

Problem

Research questions and friction points this paper is trying to address.

Evaluate data curation methods effectively

Measure dataset quality using information theory

Prevent overfitting to test data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Information-theoretic framework for data evaluation

Blackwell ordering measures dataset informativeness

Bayesian models estimate mutual information

🔎 Similar Papers

Improving Numerical Stability of Normalized Mutual Information Estimator on High Dimensions

2024-10-10arXiv.orgCitations: 0

Bosch Group

Elchingen, BY, DE

Authors to Follow