A General Information Extraction Framework Based on Formal Languages

📅 2025-05-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the absence of a unified formal framework for general-purpose information extraction. We propose the first formal-language-based universal extractor model, which maps input strings to attribute-labeled tables; each cell stores the set of text spans corresponding to its attribute—thereby unifying named entity recognition, relation extraction, and question-answering–style extraction. Innovatively, we define extractors via the joint alphabet of terminal symbols and attribute symbols, extending classical document spanner theory. We systematically establish their closure properties and precisely characterize the computational complexity of equivalence and containment checking (e.g., PSPACE-completeness), and provide both extended regular expression representations and relational algebra characterizations. This work establishes the first complete formal semantic theory for information extraction.

Technology Category

Application Category

📝 Abstract
For a terminal alphabet $Sigma$ and an attribute alphabet $Gamma$, a $(Sigma, Gamma)$-extractor is a function that maps every string over $Sigma$ to a table with a column per attribute and with sets of positions of $w$ as cell entries. This rather general information extraction framework extends the well-known document spanner framework, which has intensively been investigated in the database theory community over the last decade. Moreover, our framework is based on formal language theory in a particularly clean and simple way. In addition to this conceptual contribution, we investigate closure properties, different representation formalisms and the complexity of natural decision problems for extractors.
Problem

Research questions and friction points this paper is trying to address.

Extends document spanner framework for information extraction
Investigates closure properties of extractors
Analyzes complexity of decision problems for extractors
Innovation

Methods, ideas, or system contributions that make the work stand out.

General extractor framework using formal languages
Extends document spanner framework
Investigates closure properties and complexity
🔎 Similar Papers
No similar papers found.