🤖 AI Summary
To address high analytical query errors (up to 30%) on unstructured data and the lack of schema awareness and cross-document alignment in existing methods (e.g., RAG), this paper proposes SCAPE. First, it introduces Iterative Schema Discovery (ISD) to dynamically construct a minimal, query-driven joinable schema. Second, it employs Table-Driven Parsing (TDP), leveraging LLM hidden states to train a lightweight classifier for joint entity extraction and error correction. Innovatively, SCAPE integrates statistical calibration for error detection—ensuring high coverage—and the SCAPE-HYB hybrid strategy to balance accuracy and human annotation cost. Experiments demonstrate that SCAPE reduces error rates to below 1% while maintaining 100% recall and high precision, enabling fine-grained, accuracy–cost-controllable analytical querying over unstructured data.
📝 Abstract
Unstructured data is pervasive, but analytical queries demand structured representations, creating a significant extraction challenge. Existing methods like RAG lack schema awareness and struggle with cross-document alignment, leading to high error rates. We propose ReDD (Relational Deep Dive), a framework that dynamically discovers query-specific schemas, populates relational tables, and ensures error-aware extraction with provable guarantees. ReDD features a two-stage pipeline: (1) Iterative Schema Discovery (ISD) identifies minimal, joinable schemas tailored to each query, and (2) Tabular Data Population (TDP) extracts and corrects data using lightweight classifiers trained on LLM hidden states. A main contribution of ReDD is SCAPE, a statistically calibrated method for error detection with coverage guarantees, and SCAPE-HYB, a hybrid approach that optimizes the trade-off between accuracy and human correction costs. Experiments across diverse datasets demonstrate ReDD's effectiveness, reducing data extraction errors from up to 30% to below 1% while maintaining high schema completeness (100% recall) and precision. ReDD's modular design enables fine-grained control over accuracy-cost trade-offs, making it a robust solution for high-stakes analytical queries over unstructured corpora.