Towards Scalable Schema Mapping using Large Language Models

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

To address key challenges in multi-source data integration—including poor scalability and high maintenance costs of manual schema mapping, as well as output inconsistency, limited expressiveness (e.g., inability to support GLaV), and excessive invocation overhead in LLM-based approaches—this paper proposes a lightweight and robust LLM-augmented schema mapping framework. Methodologically: (i) sampling-based majority voting is introduced to mitigate LLM output volatility; (ii) language-enhanced prompting is designed to natively support highly expressive mappings (e.g., GLaV); and (iii) structured metadata-driven pre-filtering significantly reduces redundant LLM invocations. Experiments across multiple benchmarks demonstrate substantial improvements in mapping accuracy and robustness, with over 40% reduction in LLM calls and efficient generation of mappings for schemas containing up to hundreds of attributes. The framework establishes a new paradigm for scalable, low-maintenance data integration.

Technology Category

Application Category

📝 Abstract

The growing need to integrate information from a large number of diverse sources poses significant scalability challenges for data integration systems. These systems often rely on manually written schema mappings, which are complex, source-specific, and costly to maintain as sources evolve. While recent advances suggest that large language models (LLMs) can assist in automating schema matching by leveraging both structural and natural language cues, key challenges remain. In this paper, we identify three core issues with using LLMs for schema mapping: (1) inconsistent outputs due to sensitivity to input phrasing and structure, which we propose methods to address through sampling and aggregation techniques; (2) the need for more expressive mappings (e.g., GLaV), which strain the limited context windows of LLMs; and (3) the computational cost of repeated LLM calls, which we propose to mitigate through strategies like data type prefiltering.

Problem

Research questions and friction points this paper is trying to address.

Addressing scalability challenges in data integration systems

Improving inconsistent outputs in LLM-based schema mapping

Reducing computational costs of repeated LLM calls

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs for scalable schema mapping automation

Sampling and aggregation to ensure output consistency

Data type prefiltering to reduce computational costs

🔎 Similar Papers

SMUTF: Schema Matching Using Generative Tags and Hybrid Features