🤖 AI Summary
This work addresses the absence of a unified formal framework for general-purpose information extraction. We propose the first formal-language-based universal extractor model, which maps input strings to attribute-labeled tables; each cell stores the set of text spans corresponding to its attribute—thereby unifying named entity recognition, relation extraction, and question-answering–style extraction. Innovatively, we define extractors via the joint alphabet of terminal symbols and attribute symbols, extending classical document spanner theory. We systematically establish their closure properties and precisely characterize the computational complexity of equivalence and containment checking (e.g., PSPACE-completeness), and provide both extended regular expression representations and relational algebra characterizations. This work establishes the first complete formal semantic theory for information extraction.
📝 Abstract
For a terminal alphabet $Sigma$ and an attribute alphabet $Gamma$, a $(Sigma, Gamma)$-extractor is a function that maps every string over $Sigma$ to a table with a column per attribute and with sets of positions of $w$ as cell entries. This rather general information extraction framework extends the well-known document spanner framework, which has intensively been investigated in the database theory community over the last decade. Moreover, our framework is based on formal language theory in a particularly clean and simple way. In addition to this conceptual contribution, we investigate closure properties, different representation formalisms and the complexity of natural decision problems for extractors.