🤖 AI Summary
Semantic drift in enterprise data pipelines—caused by multilingual transformations—decouples metadata from downstream data semantics, undermining reproducibility, governance, and performance of RAG and text-to-SQL applications. To address this, we propose a fine-grained schema lineage extraction method leveraging multilingual parsing, chain-of-thought prompting (optimized for 1.3B–32B models), and human-in-the-loop evaluation. We introduce SLiCE (Schema Lineage Composite Evaluation), the first benchmark framework tailored for multilingual script lineage, alongside a high-quality dataset of 1,700 real-world annotated samples. Experiments show that open-weight 32B models match GPT-4’s lineage accuracy under standard prompting, demonstrating cost-effective lineage extraction. Our core contributions are: (1) a systematic formalization of semantic-faithful lineage; (2) the first open, multilingual schema lineage benchmark with rigorous annotations; and (3) a lightweight, efficient extraction paradigm enabling scalable, accurate lineage inference.
📝 Abstract
Enterprise data pipelines, characterized by complex transformations across multiple programming languages, often cause a semantic disconnect between original metadata and downstream data. This "semantic drift" compromises data reproducibility and governance, and impairs the utility of services like retrieval-augmented generation (RAG) and text-to-SQL systems. To address this, a novel framework is proposed for the automated extraction of fine-grained schema lineage from multilingual enterprise pipeline scripts. This method identifies four key components: source schemas, source tables, transformation logic, and aggregation operations, creating a standardized representation of data transformations. For the rigorous evaluation of lineage quality, this paper introduces the Schema Lineage Composite Evaluation (SLiCE), a metric that assesses both structural correctness and semantic fidelity. A new benchmark is also presented, comprising 1,700 manually annotated lineages from real-world industrial scripts. Experiments were conducted with 12 language models, from 1.3B to 32B small language models (SLMs) to large language models (LLMs) like GPT-4o and GPT-4.1. The results demonstrate that the performance of schema lineage extraction scales with model size and the sophistication of prompting techniques. Specially, a 32B open-source model, using a single reasoning trace, can achieve performance comparable to the GPT series under standard prompting. This finding suggests a scalable and economical approach for deploying schema-aware agents in practical applications.