🤖 AI Summary
To address the performance limitations of schema matching in open-domain and cross-domain settings—where scarce supervision severely hampers model accuracy—this paper proposes a Generative Tags mechanism. It integrates rule-based features, BERT-style semantic embeddings, and structured tags generated by large language models (LLMs), enabling a lightweight hybrid encoding and classification framework. Our key contributions are threefold: (1) We introduce the first generative tagging paradigm for schema matching, substantially reducing reliance on manual annotations; (2) We construct and publicly release HDXSM—the first large-scale, human-curated benchmark specifically designed for humanitarian-domain schema matching; (3) Our method achieves state-of-the-art performance, improving F1 score by 11.84% and ROC AUC by 5.08% over prior approaches across multiple public datasets and HDXSM.
📝 Abstract
We introduce SMUTF (Schema Matching Using Generative Tags and Hybrid Features), a unique approach for large-scale tabular data schema matching (SM), which assumes that supervised learning does not affect performance in open-domain tasks, thereby enabling effective cross-domain matching. This system uniquely combines rule-based feature engineering, pre-trained language models, and generative large language models. In an innovative adaptation inspired by the Humanitarian Exchange Language, we deploy"generative tags"for each data column, enhancing the effectiveness of SM. SMUTF exhibits extensive versatility, working seamlessly with any pre-existing pre-trained embeddings, classification methods, and generative models. Recognizing the lack of extensive, publicly available datasets for SM, we have created and open-sourced the HDXSM dataset from the public humanitarian data. We believe this to be the most exhaustive SM dataset currently available. In evaluations across various public datasets and the novel HDXSM dataset, SMUTF demonstrated exceptional performance, surpassing existing state-of-the-art models in terms of accuracy and efficiency, and improving the F1 score by 11.84% and the AUC of ROC by 5.08%. Code is available at https://github.com/fireindark707/Python-Schema-Matching.