Prompt-Matcher: Leveraging Large Models to Reduce Uncertainty in Schema Matching Results

📅 2024-08-24

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Pattern matching in database integration suffers from high result uncertainty, leading to low processing efficiency and unreliable decision support. This paper proposes the first framework integrating large language model (LLM)-based fine-grained validation with iterative probabilistic database optimization: it selects candidate matches, validates correspondences via GPT-4, and dynamically updates match probabilities in a closed loop to enhance confidence. Our contributions include: (i) embedding prompt engineering into uncertainty modeling; (ii) formally defining maximum uncertainty reduction as an NP-hard problem and designing a (1−1/e)-approximation algorithm for efficient budget-constrained optimization; and (iii) constructing two high-accuracy GPT-4 prompt templates. Evaluated on two benchmark datasets, our method achieves state-of-the-art accuracy and robustness while significantly outperforming brute-force search in computational efficiency, enabling real-time uncertainty reduction under resource constraints.

Technology Category

Application Category

📝 Abstract

Schema matching is the process of identifying correspondences between the elements of two given schemata, essential for database management systems, data integration, and data warehousing. For datasets across different scenarios, the optimal schema matching algorithm is different. For single algorithm, hyperparameter tuning also cases multiple results. All results assigned equal probabilities are stored in probabilistic databases to facilitate uncertainty management. The substantial degree of uncertainty diminishes the efficiency and reliability of data processing, thereby precluding the provision of more accurate information for decision-makers. To address this problem, we introduce a new approach based on fine-grained correspondence verification with specific prompt of Large Language Model. Our approach is an iterative loop that consists of three main components: (1) the correspondence selection algorithm, (2) correspondence verification, and (3) the update of probability distribution. The core idea is that correspondences intersect across multiple results, thereby linking the verification of correspondences to the reduction of uncertainty in candidate results. The task of selecting an optimal correspondence set to maximize the anticipated uncertainty reduction within a fixed budgetary framework is established as an NP-hard problem. We propose a novel $(1-1/e)$-approximation algorithm that significantly outperforms brute algorithm in terms of computational efficiency. To enhance correspondence verification, we have developed two prompt templates that enable GPT-4 to achieve state-of-the-art performance across two established benchmark datasets. Our comprehensive experimental evaluation demonstrates the superior effectiveness and robustness of the proposed approach.

Problem

Research questions and friction points this paper is trying to address.

Reduces uncertainty in schema matching results using large models.

Optimizes schema matching algorithms across diverse datasets and scenarios.

Enhances data processing efficiency and reliability for decision-making.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained correspondence verification using Large Language Models

Iterative loop with correspondence selection and probability update

Novel $(1-1/e)$-approximation algorithm for computational efficiency

🔎 Similar Papers

SMUTF: Schema Matching Using Generative Tags and Hybrid Features