Efficient Banzhaf-Based Data Valuation for $k$-Nearest Neighbors Classification

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the prohibitive computational complexity of game-theoretic data valuation methods—such as the Banzhaf value—in $k$-nearest neighbor ($k$NN) classifiers, which has hindered their practical adoption. The paper establishes, for the first time, that computing the Banzhaf value in this setting is #P-hard. To overcome this barrier, the authors propose the first efficient exact algorithms for both weighted and unweighted $k$NN classifiers. By exploiting the locality inherent in $k$NN and integrating dynamic programming with Monte Carlo estimation, the algorithms achieve time complexities of $O(Wkn^2)$ and $O(nk^2)$, respectively. Empirical evaluations on real-world datasets demonstrate the scalability of the proposed approach, significantly advancing the feasibility of quantifying individual data contributions in large-scale applications.

📝 Abstract

Data valuation, the task of quantifying the contribution of individual data points to model performance, has emerged as a fundamental challenge in machine learning. Game-theoretic approaches, such as the Banzhaf value, offer principled frameworks for fair data valuation; however, they suffer from exponential computational complexity. We address this challenge by developing efficient algorithms specifically tailored for computing Banzhaf values in $k$-nearest neighbor ($k$NN) classifiers. We first establish the theoretical hardness of the problem by proving that it is \#P-hard. Despite this intractability, we exploit the locality properties of $k$NN classifiers to develop practical exact algorithms. Our main contribution is a dynamic programming framework that achieves significant computational improvements: we present a pseudo-polynomial algorithm with $O(Wkn^2)$ time complexity for weighted $k$NN classifiers, where $W$ is the maximum sum of top-$k$ weights, and a specialized algorithm for unweighted $k$NN that achieves $O(nk^2)$ time complexity, that is, linear in the number of data points. We also offer efficient Monte Carlo estimation methods. Extensive experiments on real-world datasets demonstrate the practical efficiency of our approach and its effectiveness in data valuation applications.

Problem

Research questions and friction points this paper is trying to address.

data valuation

Banzhaf value

k-nearest neighbors

computational complexity

#P-hard

Innovation

Methods, ideas, or system contributions that make the work stand out.

Banzhaf value

k-nearest neighbors

data valuation