Laplace Sample Information: Data Informativeness Through a Bayesian Lens

📅 2025-05-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the problem of precisely quantifying the information content of individual samples in deep learning, to enable efficient sample selection, redundancy/noise detection, and dataset analysis. We propose Laplace Sample Information (LSI), a Bayesian-inspired metric that employs Laplace approximation to construct a Gaussian approximation of the weight posterior and measures each sample’s influence via KL divergence between the prior and posterior. LSI is the first method to integrate Laplace approximation with information theory for sample-level assessment, yielding a model-agnostic, supervision-agnostic measure. It supports typicality ranking, label-error detection, class-level information analysis, and dataset difficulty modeling, while exhibiting cross-model transferability and low computational overhead. Empirically, LSI significantly improves mislabeled-sample detection accuracy, faithfully reflects sample typicality, and successfully transfers to large-model training—accelerating convergence on both image and text benchmarks.

Technology Category

Application Category

📝 Abstract
Accurately estimating the informativeness of individual samples in a dataset is an important objective in deep learning, as it can guide sample selection, which can improve model efficiency and accuracy by removing redundant or potentially harmful samples. We propose Laplace Sample Information (LSI) measure of sample informativeness grounded in information theory widely applicable across model architectures and learning settings. LSI leverages a Bayesian approximation to the weight posterior and the KL divergence to measure the change in the parameter distribution induced by a sample of interest from the dataset. We experimentally show that LSI is effective in ordering the data with respect to typicality, detecting mislabeled samples, measuring class-wise informativeness, and assessing dataset difficulty. We demonstrate these capabilities of LSI on image and text data in supervised and unsupervised settings. Moreover, we show that LSI can be computed efficiently through probes and transfers well to the training of large models.
Problem

Research questions and friction points this paper is trying to address.

Estimating sample informativeness in deep learning datasets
Proposing Laplace Sample Information for measuring informativeness
Applying LSI to detect mislabels and assess dataset difficulty
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian approximation for weight posterior
KL divergence measures parameter distribution change
Efficient computation via probes for large models
🔎 Similar Papers
No similar papers found.
J
Johannes Kaiser
AI in Healthcare and Medicine; Munich Center for Machine Learning (MCML), Technical University of Munich, Germany
Kristian Schwethelm
Kristian Schwethelm
Technical University of Munich
Large Language ModelsTrustworthy AIPrivacy-Preserving ML
Daniel Rueckert
Daniel Rueckert
Technical University of Munich and Imperial College London
Machine LearningMedical Image ComputingBiomedical Image AnalysisComputer Vision
G
Georgios Kaissis
AI in Healthcare and Medicine; Munich Center for Machine Learning (MCML), Technical University of Munich, Germany