A General Information Extraction Framework Based on Formal Languages

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work addresses the absence of a unified formal framework for general-purpose information extraction. We propose the first formal-language-based universal extractor model, which maps input strings to attribute-labeled tables; each cell stores the set of text spans corresponding to its attribute—thereby unifying named entity recognition, relation extraction, and question-answering–style extraction. Innovatively, we define extractors via the joint alphabet of terminal symbols and attribute symbols, extending classical document spanner theory. We systematically establish their closure properties and precisely characterize the computational complexity of equivalence and containment checking (e.g., PSPACE-completeness), and provide both extended regular expression representations and relational algebra characterizations. This work establishes the first complete formal semantic theory for information extraction.

Technology Category

Application Category

📝 Abstract

For a terminal alphabet $Sigma$ and an attribute alphabet $Gamma$, a $(Sigma, Gamma)$-extractor is a function that maps every string over $Sigma$ to a table with a column per attribute and with sets of positions of $w$ as cell entries. This rather general information extraction framework extends the well-known document spanner framework, which has intensively been investigated in the database theory community over the last decade. Moreover, our framework is based on formal language theory in a particularly clean and simple way. In addition to this conceptual contribution, we investigate closure properties, different representation formalisms and the complexity of natural decision problems for extractors.

Problem

Research questions and friction points this paper is trying to address.

Extends document spanner framework for information extraction

Investigates closure properties of extractors

Analyzes complexity of decision problems for extractors

Innovation

Methods, ideas, or system contributions that make the work stand out.

General extractor framework using formal languages

Extends document spanner framework

Investigates closure properties and complexity

🔎 Similar Papers

No similar papers found.