Schema Lineage Extraction at Scale: Multilingual Pipelines, Composite Evaluation, and Language-Model Benchmarks

📅 2025-08-10

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

Semantic drift in enterprise data pipelines—caused by multilingual transformations—decouples metadata from downstream data semantics, undermining reproducibility, governance, and performance of RAG and text-to-SQL applications. To address this, we propose a fine-grained schema lineage extraction method leveraging multilingual parsing, chain-of-thought prompting (optimized for 1.3B–32B models), and human-in-the-loop evaluation. We introduce SLiCE (Schema Lineage Composite Evaluation), the first benchmark framework tailored for multilingual script lineage, alongside a high-quality dataset of 1,700 real-world annotated samples. Experiments show that open-weight 32B models match GPT-4’s lineage accuracy under standard prompting, demonstrating cost-effective lineage extraction. Our core contributions are: (1) a systematic formalization of semantic-faithful lineage; (2) the first open, multilingual schema lineage benchmark with rigorous annotations; and (3) a lightweight, efficient extraction paradigm enabling scalable, accurate lineage inference.

Technology Category

Application Category

📝 Abstract

Enterprise data pipelines, characterized by complex transformations across multiple programming languages, often cause a semantic disconnect between original metadata and downstream data. This "semantic drift" compromises data reproducibility and governance, and impairs the utility of services like retrieval-augmented generation (RAG) and text-to-SQL systems. To address this, a novel framework is proposed for the automated extraction of fine-grained schema lineage from multilingual enterprise pipeline scripts. This method identifies four key components: source schemas, source tables, transformation logic, and aggregation operations, creating a standardized representation of data transformations. For the rigorous evaluation of lineage quality, this paper introduces the Schema Lineage Composite Evaluation (SLiCE), a metric that assesses both structural correctness and semantic fidelity. A new benchmark is also presented, comprising 1,700 manually annotated lineages from real-world industrial scripts. Experiments were conducted with 12 language models, from 1.3B to 32B small language models (SLMs) to large language models (LLMs) like GPT-4o and GPT-4.1. The results demonstrate that the performance of schema lineage extraction scales with model size and the sophistication of prompting techniques. Specially, a 32B open-source model, using a single reasoning trace, can achieve performance comparable to the GPT series under standard prompting. This finding suggests a scalable and economical approach for deploying schema-aware agents in practical applications.

Problem

Research questions and friction points this paper is trying to address.

Extracting fine-grained schema lineage from multilingual enterprise pipelines

Addressing semantic drift in data reproducibility and governance

Evaluating lineage quality with structural and semantic metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated extraction of multilingual schema lineage

Composite evaluation metric for lineage quality

Benchmarking with 1,700 annotated industrial scripts

🔎 Similar Papers

No similar papers found.

JPMorgan Chase

Jersey City, NJ, United States / Houston, TX, United States / Plano, TX, United States

Research Scientist, AI Language