Relational Deep Dive: Error-Aware Queries Over Unstructured Data

📅 2025-11-04

📈 Citations: 0

✨ Influential: 0

career value

130K/year

🤖 AI Summary

To address high analytical query errors (up to 30%) on unstructured data and the lack of schema awareness and cross-document alignment in existing methods (e.g., RAG), this paper proposes SCAPE. First, it introduces Iterative Schema Discovery (ISD) to dynamically construct a minimal, query-driven joinable schema. Second, it employs Table-Driven Parsing (TDP), leveraging LLM hidden states to train a lightweight classifier for joint entity extraction and error correction. Innovatively, SCAPE integrates statistical calibration for error detection—ensuring high coverage—and the SCAPE-HYB hybrid strategy to balance accuracy and human annotation cost. Experiments demonstrate that SCAPE reduces error rates to below 1% while maintaining 100% recall and high precision, enabling fine-grained, accuracy–cost-controllable analytical querying over unstructured data.

Technology Category

Application Category

📝 Abstract

Unstructured data is pervasive, but analytical queries demand structured representations, creating a significant extraction challenge. Existing methods like RAG lack schema awareness and struggle with cross-document alignment, leading to high error rates. We propose ReDD (Relational Deep Dive), a framework that dynamically discovers query-specific schemas, populates relational tables, and ensures error-aware extraction with provable guarantees. ReDD features a two-stage pipeline: (1) Iterative Schema Discovery (ISD) identifies minimal, joinable schemas tailored to each query, and (2) Tabular Data Population (TDP) extracts and corrects data using lightweight classifiers trained on LLM hidden states. A main contribution of ReDD is SCAPE, a statistically calibrated method for error detection with coverage guarantees, and SCAPE-HYB, a hybrid approach that optimizes the trade-off between accuracy and human correction costs. Experiments across diverse datasets demonstrate ReDD's effectiveness, reducing data extraction errors from up to 30% to below 1% while maintaining high schema completeness (100% recall) and precision. ReDD's modular design enables fine-grained control over accuracy-cost trade-offs, making it a robust solution for high-stakes analytical queries over unstructured corpora.

Problem

Research questions and friction points this paper is trying to address.

Dynamically discovers query-specific schemas for unstructured data analysis

Reduces extraction errors from 30% to 1% with provable guarantees

Optimizes accuracy-cost trade-offs using calibrated error detection methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamically discovers query-specific schemas for extraction

Populates relational tables using lightweight classifiers on LLM states

Ensures error-aware extraction with statistically calibrated guarantees

🔎 Similar Papers

No similar papers found.