SMUTF: Schema Matching Using Generative Tags and Hybrid Features

📅 2024-01-22
🏛️ arXiv.org
📈 Citations: 7
Influential: 0
📄 PDF
🤖 AI Summary
To address the performance limitations of schema matching in open-domain and cross-domain settings—where scarce supervision severely hampers model accuracy—this paper proposes a Generative Tags mechanism. It integrates rule-based features, BERT-style semantic embeddings, and structured tags generated by large language models (LLMs), enabling a lightweight hybrid encoding and classification framework. Our key contributions are threefold: (1) We introduce the first generative tagging paradigm for schema matching, substantially reducing reliance on manual annotations; (2) We construct and publicly release HDXSM—the first large-scale, human-curated benchmark specifically designed for humanitarian-domain schema matching; (3) Our method achieves state-of-the-art performance, improving F1 score by 11.84% and ROC AUC by 5.08% over prior approaches across multiple public datasets and HDXSM.

Technology Category

Application Category

📝 Abstract
We introduce SMUTF (Schema Matching Using Generative Tags and Hybrid Features), a unique approach for large-scale tabular data schema matching (SM), which assumes that supervised learning does not affect performance in open-domain tasks, thereby enabling effective cross-domain matching. This system uniquely combines rule-based feature engineering, pre-trained language models, and generative large language models. In an innovative adaptation inspired by the Humanitarian Exchange Language, we deploy"generative tags"for each data column, enhancing the effectiveness of SM. SMUTF exhibits extensive versatility, working seamlessly with any pre-existing pre-trained embeddings, classification methods, and generative models. Recognizing the lack of extensive, publicly available datasets for SM, we have created and open-sourced the HDXSM dataset from the public humanitarian data. We believe this to be the most exhaustive SM dataset currently available. In evaluations across various public datasets and the novel HDXSM dataset, SMUTF demonstrated exceptional performance, surpassing existing state-of-the-art models in terms of accuracy and efficiency, and improving the F1 score by 11.84% and the AUC of ROC by 5.08%. Code is available at https://github.com/fireindark707/Python-Schema-Matching.
Problem

Research questions and friction points this paper is trying to address.

Develops SMUTF for cross-domain schema matching without supervised learning
Combines rule-based features, pre-trained models, and generative tags
Introduces HDXSM dataset to address lack of public SM datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines rule-based features and language models
Uses generative tags for column matching
Works with any pre-trained embeddings and models
🔎 Similar Papers
No similar papers found.