🤖 AI Summary
To address key challenges in multi-source data integration—including poor scalability and high maintenance costs of manual schema mapping, as well as output inconsistency, limited expressiveness (e.g., inability to support GLaV), and excessive invocation overhead in LLM-based approaches—this paper proposes a lightweight and robust LLM-augmented schema mapping framework. Methodologically: (i) sampling-based majority voting is introduced to mitigate LLM output volatility; (ii) language-enhanced prompting is designed to natively support highly expressive mappings (e.g., GLaV); and (iii) structured metadata-driven pre-filtering significantly reduces redundant LLM invocations. Experiments across multiple benchmarks demonstrate substantial improvements in mapping accuracy and robustness, with over 40% reduction in LLM calls and efficient generation of mappings for schemas containing up to hundreds of attributes. The framework establishes a new paradigm for scalable, low-maintenance data integration.
📝 Abstract
The growing need to integrate information from a large number of diverse sources poses significant scalability challenges for data integration systems. These systems often rely on manually written schema mappings, which are complex, source-specific, and costly to maintain as sources evolve. While recent advances suggest that large language models (LLMs) can assist in automating schema matching by leveraging both structural and natural language cues, key challenges remain. In this paper, we identify three core issues with using LLMs for schema mapping: (1) inconsistent outputs due to sensitivity to input phrasing and structure, which we propose methods to address through sampling and aggregation techniques; (2) the need for more expressive mappings (e.g., GLaV), which strain the limited context windows of LLMs; and (3) the computational cost of repeated LLM calls, which we propose to mitigate through strategies like data type prefiltering.