SMUTF: Schema Matching Using Generative Tags and Hybrid Features

📅 2024-01-22

🏛️ arXiv.org

📈 Citations: 7

✨ Influential: 0

career value

193K/year

🤖 AI Summary

To address the performance limitations of schema matching in open-domain and cross-domain settings—where scarce supervision severely hampers model accuracy—this paper proposes a Generative Tags mechanism. It integrates rule-based features, BERT-style semantic embeddings, and structured tags generated by large language models (LLMs), enabling a lightweight hybrid encoding and classification framework. Our key contributions are threefold: (1) We introduce the first generative tagging paradigm for schema matching, substantially reducing reliance on manual annotations; (2) We construct and publicly release HDXSM—the first large-scale, human-curated benchmark specifically designed for humanitarian-domain schema matching; (3) Our method achieves state-of-the-art performance, improving F1 score by 11.84% and ROC AUC by 5.08% over prior approaches across multiple public datasets and HDXSM.

Technology Category

Application Category

📝 Abstract

We introduce SMUTF (Schema Matching Using Generative Tags and Hybrid Features), a unique approach for large-scale tabular data schema matching (SM), which assumes that supervised learning does not affect performance in open-domain tasks, thereby enabling effective cross-domain matching. This system uniquely combines rule-based feature engineering, pre-trained language models, and generative large language models. In an innovative adaptation inspired by the Humanitarian Exchange Language, we deploy"generative tags"for each data column, enhancing the effectiveness of SM. SMUTF exhibits extensive versatility, working seamlessly with any pre-existing pre-trained embeddings, classification methods, and generative models. Recognizing the lack of extensive, publicly available datasets for SM, we have created and open-sourced the HDXSM dataset from the public humanitarian data. We believe this to be the most exhaustive SM dataset currently available. In evaluations across various public datasets and the novel HDXSM dataset, SMUTF demonstrated exceptional performance, surpassing existing state-of-the-art models in terms of accuracy and efficiency, and improving the F1 score by 11.84% and the AUC of ROC by 5.08%. Code is available at https://github.com/fireindark707/Python-Schema-Matching.

Problem

Research questions and friction points this paper is trying to address.

Develops SMUTF for cross-domain schema matching without supervised learning

Combines rule-based features, pre-trained models, and generative tags

Introduces HDXSM dataset to address lack of public SM datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines rule-based features and language models

Uses generative tags for column matching

Works with any pre-trained embeddings and models

🔎 Similar Papers

Prompt-Matcher: Leveraging Large Models to Reduce Uncertainty in Schema Matching Results

2024-08-24Citations: 0

💼 Related Jobs

PhD GenAI Research Scientist Intern

Databricks

SF Bay Area Hourly Rate$54—$60 USD

San Francisco, CA, USA

Authors to Follow