Schema Lineage Extraction at Scale: Multilingual Pipelines, Composite Evaluation, and Language-Model Benchmarks

📅 2025-08-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Semantic drift in enterprise data pipelines—caused by multilingual transformations—decouples metadata from downstream data semantics, undermining reproducibility, governance, and performance of RAG and text-to-SQL applications. To address this, we propose a fine-grained schema lineage extraction method leveraging multilingual parsing, chain-of-thought prompting (optimized for 1.3B–32B models), and human-in-the-loop evaluation. We introduce SLiCE (Schema Lineage Composite Evaluation), the first benchmark framework tailored for multilingual script lineage, alongside a high-quality dataset of 1,700 real-world annotated samples. Experiments show that open-weight 32B models match GPT-4’s lineage accuracy under standard prompting, demonstrating cost-effective lineage extraction. Our core contributions are: (1) a systematic formalization of semantic-faithful lineage; (2) the first open, multilingual schema lineage benchmark with rigorous annotations; and (3) a lightweight, efficient extraction paradigm enabling scalable, accurate lineage inference.

Technology Category

Application Category

📝 Abstract
Enterprise data pipelines, characterized by complex transformations across multiple programming languages, often cause a semantic disconnect between original metadata and downstream data. This "semantic drift" compromises data reproducibility and governance, and impairs the utility of services like retrieval-augmented generation (RAG) and text-to-SQL systems. To address this, a novel framework is proposed for the automated extraction of fine-grained schema lineage from multilingual enterprise pipeline scripts. This method identifies four key components: source schemas, source tables, transformation logic, and aggregation operations, creating a standardized representation of data transformations. For the rigorous evaluation of lineage quality, this paper introduces the Schema Lineage Composite Evaluation (SLiCE), a metric that assesses both structural correctness and semantic fidelity. A new benchmark is also presented, comprising 1,700 manually annotated lineages from real-world industrial scripts. Experiments were conducted with 12 language models, from 1.3B to 32B small language models (SLMs) to large language models (LLMs) like GPT-4o and GPT-4.1. The results demonstrate that the performance of schema lineage extraction scales with model size and the sophistication of prompting techniques. Specially, a 32B open-source model, using a single reasoning trace, can achieve performance comparable to the GPT series under standard prompting. This finding suggests a scalable and economical approach for deploying schema-aware agents in practical applications.
Problem

Research questions and friction points this paper is trying to address.

Extracting fine-grained schema lineage from multilingual enterprise pipelines
Addressing semantic drift in data reproducibility and governance
Evaluating lineage quality with structural and semantic metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated extraction of multilingual schema lineage
Composite evaluation metric for lineage quality
Benchmarking with 1,700 annotated industrial scripts
🔎 Similar Papers
No similar papers found.
Jiaqi Yin
Jiaqi Yin
University of Maryland
EDALogic SynthesisFormal Verification
Y
Yi-Wei Chen
Microsoft
M
Meng-Lung Lee
Antra. Inc.
X
Xiya Liu
Microsoft