KAIROS: Scalable Model-Agnostic Data Valuation

📅 2025-06-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing data valuation methods suffer from three key limitations: model-dependent approaches inherit model biases; algorithmic methods (e.g., Data Shapley) require extensive retraining, incurring prohibitive computational cost; and Wasserstein-based model-agnostic methods rely on approximations, leading to inaccurate leave-one-out (LOO) utility ranking. This paper introduces the first scalable, model-agnostic data valuation framework grounded in Maximum Mean Discrepancy (MMD). By deriving a closed-form conditional kernel MMD estimator, it achieves high-fidelity approximation of LOO utility. The framework supports online batch updates with O(mN) complexity and guarantees reproducible ranking. Moreover, it enables both label- and feature-level error detection. Empirical evaluation on noisy, mislabeled, and backdoored benchmarks demonstrates significantly improved ranking accuracy—up to 50× faster than state-of-the-art methods—while providing rigorous theoretical guarantees and practical deployability.

Technology Category

Application Category

📝 Abstract
Training data increasingly shapes not only model accuracy but also regulatory compliance and market valuation of AI assets. Yet existing valuation methods remain inadequate: model-based techniques depend on a single fitted model and inherit its biases, while algorithm-based approaches such as Data Shapley require costly retrainings at web scale. Recent Wasserstein-based model-agnostic methods rely on approximations that misrank examples relative to their true leave-one-out (LOO) utility. We introduce KAIROS, a scalable, model-agnostic valuation framework that assigns each example a distributional influence score: its contribution to the Maximum Mean Discrepancy (MMD) between the empirical training distribution and a clean reference set. Unlike Wasserstein surrogates, our MMD-based influence admits a closed-form solution that faithfully approximates the exact LOO ranking within $O(1/N^2)$ error, requires no retraining, and naturally extends to conditional kernels for unified label- and feature-error detection. Moreover, KAIROS supports efficient online updates: when a new batch of size m arrives, all scores can be updated in $O(mN)$ time, delivering up to 50x speedup without compromising ranking quality. Empirical evaluations on noise, mislabeling, and poisoning benchmarks show that KAIROS consistently outperforms state-of-the-art model-, Shapley-, and Wasserstein-based baselines in both accuracy and runtime. We provide rigorous theoretical guarantees, including symmetry for reproducible rankings and density-separation for interpretable thresholds.
Problem

Research questions and friction points this paper is trying to address.

Scalable model-agnostic data valuation framework
Accurate leave-one-out utility approximation
Efficient online updates for large datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Model-agnostic MMD-based influence scoring
Closed-form solution for exact LOO ranking
Efficient online updates with O(mN) complexity
🔎 Similar Papers
No similar papers found.