Rethinking Data Value: Asymmetric Data Shapley for Structure-Aware Valuation in Data Markets and Machine Learning Pipelines

📅 2025-11-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional Data Shapley (DS) assumes data point symmetry, failing to capture directional and temporal dependencies—such as redundancy, augmentation-induced dependencies, and stage-specific contributions—in modern data markets and multi-stage AI pipelines (e.g., federated learning, staged fine-tuning of large language models). Method: We propose Asymmetric Data Shapley (ADS), a structure-aware valuation framework that relaxes the symmetry assumption by defining marginal contribution averages over permutations respecting sequential constraints. ADS preserves efficiency and linearity while subsuming classical DS as a special case. We further develop two practical estimators: a Monte Carlo estimator (MC-ADS) for general models and a k-nearest-neighbor proxy method (KNN-ADS) tailored for KNN predictors. Results: Experiments demonstrate that ADS significantly improves valuation accuracy and interpretability, effectively distinguishing novel from redundant data contributions. It consistently outperforms baseline methods across directional and temporally sensitive scenarios.

Technology Category

Application Category

📝 Abstract
Rigorous valuation of individual data sources is critical for fair compensation in data markets, informed data acquisition, and transparent development of ML/AI models. Classical Data Shapley (DS) provides a essential axiomatic framework for data valuation but is constrained by its symmetry axiom that assumes interchangeability of data sources. This assumption fails to capture the directional and temporal dependencies prevalent in modern ML/AI workflows, including the reliance of duplicated or augmented data on original sources and the order-specific contributions in sequential pipelines such as federated learning and multi-stage LLM fine tuning. To address these limitations, we introduce Asymmetric Data Shapley (ADS), a structure-aware data valuation framework for modern ML/AI pipelines. ADS relaxes symmetry by averaging marginal contributions only over permutations consistent with an application-specific ordering of data groups. It preserves efficiency and linearity, maintains within group symmetry and directional precedence across groups, and reduces to DS when the ordering collapses to a single group. We develop two complementary computational procedures for ADS: (i) a Monte Carlo estimator (MC-ADS) with finite-sample accuracy guarantees, and (ii) a k-nearest neighbor surrogate (KNN-ADS) that is exact and efficient for KNN predictors. Across representative settings with directional and temporal dependence, ADS consistently outperforms benchmark methods by distinguishing novel from redundant contributions and respecting the sequential nature of training. These results establish ADS as a principled and practical approach to equitable data valuation in data markets and complex ML/AI pipelines.
Problem

Research questions and friction points this paper is trying to address.

Addresses limitations of symmetric data valuation in modern ML workflows
Captures directional and temporal dependencies in sequential AI pipelines
Provides equitable valuation for data markets and complex ML systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Asymmetric Data Shapley relaxes symmetry axiom
Monte Carlo estimator with accuracy guarantees
KNN surrogate for exact efficient computation
🔎 Similar Papers
No similar papers found.