Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes

📅 2023-04-19

🏛️ Proceedings of the VLDB Endowment

📈 Citations: 61

✨ Influential: 2

career value

146K/year

🤖 AI Summary

This paper addresses the automated extraction of queryable tables from semi-structured documents. We propose Evaporate, a general-purpose, zero-shot, domain-agnostic system. Its core method, Evaporate-Code+, leverages large language models (LLMs) to generate multiple candidate extraction functions via in-context learning and code synthesis; it then selects the optimal strategy through weakly supervised ensemble filtering and executes extraction with only sublinear document traversal. Evaluated across 16 real-world scenarios, Evaporate reduces LLM invocation count by 110× over state-of-the-art systems, achieves significantly higher accuracy than direct prompting, and maintains both high precision and low computational cost. Our key contribution is the first end-to-end table-oriented distillation framework that requires no manual annotation, no domain-specific customization, and scales seamlessly to heterogeneous data lakes.

📝 Abstract

A long standing goal in the data management community is developing systems that input documents and output queryable tables without user effort. Given the sheer variety of potential documents, state-of-the art systems make simplifying assumptions and use domain specific training. In this work, we ask whether we can maintain generality by using the in-context learning abilities of large language models (LLMs). We propose and evaluate Evaporate, a prototype system powered by LLMs. We identify two strategies for implementing this system: prompt the LLM to directly extract values from documents or prompt the LLM to synthesize code that performs the extraction. Our evaluations show a cost-quality tradeoff between these two approaches. Code synthesis is cheap, but far less accurate than directly processing each document with the LLM. To improve quality while maintaining low cost, we propose an extended implementation, Evaporate-Code+, which achieves better quality than direct extraction. Our insight is to generate many candidate functions and ensemble their extractions using weak supervision. Evaporate-Code+ outperforms the state-of-the art systems using a sublinear pass over the documents with the LLM. This equates to a 110X reduction in the number of documents the LLM needs to process across our 16 real-world evaluation settings.

Problem

Research questions and friction points this paper is trying to address.

Automating structured data extraction from heterogeneous documents.

Using LLMs for general, domain-agnostic data processing.

Balancing cost and quality in LLM-based extraction systems.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses LLMs for automated data extraction

Implements EVAPORATE-CODE+ for cost-quality balance

Reduces token processing by 110x via sublinear passes

🔎 Similar Papers

No similar papers found.