Schema Inference for Tabular Data Repositories Using Large Language Models

📅 2025-09-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Inferring hierarchical conceptual schemas—including entity types, attributes, and semantic relationships—from large-scale tabular data with sparse metadata and high heterogeneity remains challenging, especially when relying solely on column names and cell values. This paper introduces SI-LLM, the first end-to-end large language model (LLM)-driven framework for schema inference without external knowledge bases. Leveraging prompt engineering and stepwise reasoning, SI-LLM jointly performs column semantic role identification, entity type clustering, and relation extraction. Evaluated on two real-world datasets, it achieves state-of-the-art or superior end-to-end performance, significantly enhancing semantic interpretability and structuralization efficiency across heterogeneous tables. All code, prompt templates, and datasets are publicly released.

Technology Category

Application Category

📝 Abstract
Minimally curated tabular data often contain representational inconsistencies across heterogeneous sources, and are accompanied by sparse metadata. Working with such data is intimidating. While prior work has advanced dataset discovery and exploration, schema inference remains difficult when metadata are limited. We present SI-LLM (Schema Inference using Large Language Models), which infers a concise conceptual schema for tabular data using only column headers and cell values. The inferred schema comprises hierarchical entity types, attributes, and inter-type relationships. In extensive evaluation on two datasets from web tables and open data, SI-LLM achieves promising end-to-end results, as well as better or comparable results to state-of-the-art methods at each step. All source code, full prompts, and datasets of SI-LLM are available at https://github.com/PierreWoL/SILLM.
Problem

Research questions and friction points this paper is trying to address.

Inferring concise conceptual schemas from tabular data
Addressing representational inconsistencies across heterogeneous sources
Overcoming sparse metadata limitations using column headers and cell values
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses LLMs for schema inference from headers
Infers hierarchical entities and relationships
Achieves state-of-the-art results with minimal metadata
🔎 Similar Papers
No similar papers found.