🤖 AI Summary
To address the challenges of operator contribution uninterpretability and combinatorial explosion in automated data preprocessing pipeline construction, this paper proposes a Shapley-value-based hierarchical search framework. It pioneers the integration of game-theoretic Shapley values into pipeline optimization, reducing exponential search complexity to polynomial via hierarchical decomposition. We introduce a category-structure–operator-optimization decoupling mechanism and propose Permutation Shapley Values—a position-aware variant—to model order-dependent operator interactions. Furthermore, we incorporate multi-armed bandits for efficient categorical evaluation. Evaluated on 18 benchmark datasets, our method achieves 98.1% of the performance of high-budget baselines using only 24% of the evaluations, outperforming the strongest reinforcement learning baseline by 3.6%. Crucially, operator-level Shapley values exhibit strong correlation with actual performance (Spearman’s ρ = 0.933), validating their interpretability and fidelity.
📝 Abstract
Automated data preparation pipeline construction is critical for machine learning success, yet existing methods suffer from two fundamental limitations: they treat pipeline construction as black-box optimization without quantifying individual operator contributions, and they struggle with the combinatorial explosion of the search space ($N^M$ configurations for N operators and pipeline length M). We introduce ShapleyPipe, a principled framework that leverages game-theoretic Shapley values to systematically quantify each operator's marginal contribution while maintaining full interpretability. Our key innovation is a hierarchical decomposition that separates category-level structure search from operator-level refinement, reducing the search complexity from exponential to polynomial. To make Shapley computation tractable, we develop: (1) a Multi-Armed Bandit mechanism for intelligent category evaluation with provable convergence guarantees, and (2) Permutation Shapley values to correctly capture position-dependent operator interactions. Extensive evaluation on 18 diverse datasets demonstrates that ShapleyPipe achieves 98.1% of high-budget baseline performance while using 24% fewer evaluations, and outperforms the state-of-the-art reinforcement learning method by 3.6%. Beyond performance gains, ShapleyPipe provides interpretable operator valuations ($ρ$=0.933 correlation with empirical performance) that enable data-driven pipeline analysis and systematic operator library refinement.