Text Data Integration

📅 2026-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of effectively integrating structured data with unstructured text, a longstanding barrier in data management. It presents the first systematic argument for the necessity of textual data integration and introduces a unified framework that synergistically combines natural language processing, knowledge extraction, and traditional data integration techniques. By leveraging semantic alignment, the framework achieves deep integration between textual content and structured schemas, thereby tackling key challenges inherent in heterogeneous data integration. The work comprehensively surveys existing methodologies and outstanding issues, establishing a theoretical foundation for the emerging field of textual data integration and offering clear guidance for future research and practical implementation.
📝 Abstract
Data comes in many forms. From a shallow perspective, they can be viewed as being either in structured (e.g., as a relation, as key-value pairs) or unstructured (e.g., text, image) formats. So far, machines have been fairly good at processing and reasoning over structured data that follows a precise schema. However, the heterogeneity of data poses a significant challenge on how well diverse categories of data can be meaningfully stored and processed. Data Integration, a crucial part of the data engineering pipeline, addresses this by combining disparate data sources and providing unified data access to end-users. Until now, most data integration systems have leaned on only combining structured data sources. Nevertheless, unstructured data (a.k.a. free text) also contains a plethora of knowledge waiting to be utilized. Thus, in this chapter, we firstly make the case for the integration of textual data, to later present its challenges, state of the art and open problems.
Problem

Research questions and friction points this paper is trying to address.

Data Integration
Unstructured Data
Text Data
Heterogeneous Data
Structured Data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Text Data Integration
Unstructured Data
Data Integration
Knowledge Extraction
Heterogeneous Data
🔎 Similar Papers
No similar papers found.
M
Md Ataur Rahman
Universitat Polit`ecnica de Catalunya, Barcelona, Spain; Universit´e libre de Bruxelles, Brussels, Belgium
Dimitris Sacharidis
Dimitris Sacharidis
Université Libre de Bruxelles (ULB)
Responsible AI
Oscar Romero
Oscar Romero
Universitat Politècnica de Catalunya, BarcelonaTech
Data ManagementData GovernanceData IntegrationBig DataData Science
S
Sergi Nadal
Universitat Polit`ecnica de Catalunya, Barcelona, Spain