🤖 AI Summary
This work addresses the #P-hard complexity of Shapley value computation, which stems from the need to enumerate exponentially many coalitions, and the common oversight in existing methods of models’ local dependencies on training data. The authors propose the Local Shapley framework, which leverages model-induced locality—such as KNN neighborhoods, tree model leaf nodes, or GNN receptive fields—to define support sets that restrict computations to the critical subsets influencing predictions. Theoretical analysis shows that computational complexity depends on the number of distinct influence subsets and establishes an information-theoretic lower bound on retraining. Building on this, an optimal reuse strategy is devised to achieve lossless acceleration. The framework further yields the LSMR algorithm and its unbiased, exponentially convergent Monte Carlo variant, LSMR-A, which substantially reduce retraining frequency and runtime while preserving high estimation fidelity.
📝 Abstract
The Shapley value provides a principled foundation for data valuation, but exact computation is #P-hard due to the exponential coalition space. Existing accelerations remain global and ignore a structural property of modern predictors: for a given test instance, only a small subset of training points influences the prediction. We formalize this model-induced locality through support sets defined by the model's computational pathway (e.g., neighbors in KNN, leaves in trees, receptive fields in GNNs), showing that Shapley computation can be projected onto these supports without loss when locality is exact. This reframes Shapley evaluation as a structured data processing problem over overlapping support-induced subset families rather than exhaustive coalition enumeration. We prove that the intrinsic complexity of Local Shapley is governed by the number of distinct influential subsets, establishing an information-theoretic lower bound on retraining operations. Guided by this result, we propose LSMR (Local Shapley via Model Reuse), an optimal subset-centric algorithm that trains each influential subset exactly once via support mapping and pivot scheduling. For larger supports, we develop LSMR-A, a reuse-aware Monte Carlo estimator that remains unbiased with exponential concentration, with runtime determined by the number of distinct sampled subsets rather than total draws. Experiments across multiple model families demonstrate substantial retraining reductions and speedups while preserving high valuation fidelity.