🤖 AI Summary
This work addresses the joint schema-value matching problem between pivot tables and relational tables in data lakes, which demands semantic consistency, value compatibility, and generalization under anonymized data. To this end, we propose PiLLar, a novel framework that, for the first time, formulates the task as a large language model (LLM)-guided Monte Carlo Tree Search (MCTS), enabling unsupervised, training-free cross-domain adaptation. We provide a dynamic theoretical error analysis that guarantees asymptotic convergence. Furthermore, we construct PTbench, the first real-world benchmark for this problem. Experimental results show that PiLLar achieves an average matching accuracy of 87.94% on PTbench, significantly outperforming existing methods and demonstrating its effectiveness and strong generalization capability.
📝 Abstract
Pivot tables are ubiquitous in data lakes of modern data ecosystems, making accurate schema matching over pivot tables a key prerequisite for data integration. In this paper, we focus on matching for pivot table schema, which is a novel joint schema-value matching task. It aims to align schemas between pivot tables and standard relational tables, where a correct match must be semantically consistent at the schema level and compatible at the value level. However, due to the inherent data sensitivity of this task, the prevalence of anonymized data in practice poses significant challenges to its matching accuracy and generalization capability. To tackle these challenges, we propose PiLLar, the first matching for pivot table schema framework. We first formulate PiLLar as an LLM-driven search paradigm that operates with minimal annotated privacy-compliant data, thereby achieving training-free adaptation across diverse domains. Next, we provide a theoretical analysis on the error dynamics of the paradigm to ensure the asymptotic convergence of the proposed method. Furthermore, we introduce a new benchmark PTbench, derived from four representative real-world domains and constructed by mining unpivot-suitable tables, performing unpivot on semantically coherent attributes, and applying sampling and anonymization. Extensive experiments demonstrate the superiority of PiLLar, which achieves an average accuracy of 87.94% on the correctly predicted matches.