Proper Dataset Valuation by Pointwise Mutual Information

📅 2024-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Test-set overfitting in data curation evaluation distorts quality assessment. Method: We propose an information-theoretic framework for data valuation, introducing Shannon mutual information to quantify dataset value—specifically, defining a parameter-level informativeness metric that measures how well data characterize the true model parameters. Grounded in the Blackwell order, we design a mutual information estimator leveraging Bayesian posteriors and data embeddings, circumventing the Goodhart’s law risk inherent in conventional test-accuracy-based evaluation. Contribution/Results: Experiments demonstrate that our method effectively identifies and penalizes spurious optimization strategies (e.g., excessive cleaning causing information loss), whereas standard evaluation erroneously rewards overfitted solutions. On real-world datasets, it exhibits superior robustness and discriminative power in distinguishing high-value from low-value data.

Technology Category

Application Category

📝 Abstract
Data plays a central role in the development of modern artificial intelligence, with high-quality data emerging as a key driver of model performance. This has prompted the development of various data curation methods in recent years. However, measuring the effectiveness of these data curation techniques remains a major challenge. Traditional evaluation methods, which assess a trained model's performance on specific benchmarks, risk promoting practices that merely make the data more similar to the test data. This issue exemplifies Goodhart's law: when a measure becomes a target, it ceases to be a good measure. To address this, we propose an information-theoretic framework for evaluating data curation methods, where dataset quality is measured by its informativeness about the true model parameters using the Blackwell ordering. We compare informativeness by the Shannon mutual information of the evaluated data and the test data, and we propose a novel method for estimating the mutual information of datasets by training Bayesian models on embedded data and computing the mutual information from the model's parameter posteriors. Experiments on real-world data demonstrate that our mutual information-based evaluation assigns appropriately lower scores to data curation strategies that reduce dataset informativeness, while traditional test score-based evaluation methods may favor data curation strategies that overfit to the test set but compromise the training data's informativeness.
Problem

Research questions and friction points this paper is trying to address.

Evaluate data curation methods effectively
Measure dataset quality using information theory
Prevent overfitting to test data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Information-theoretic framework for data evaluation
Blackwell ordering measures dataset informativeness
Bayesian models estimate mutual information
🔎 Similar Papers
No similar papers found.