🤖 AI Summary
Inferring hierarchical conceptual schemas—including entity types, attributes, and semantic relationships—from large-scale tabular data with sparse metadata and high heterogeneity remains challenging, especially when relying solely on column names and cell values. This paper introduces SI-LLM, the first end-to-end large language model (LLM)-driven framework for schema inference without external knowledge bases. Leveraging prompt engineering and stepwise reasoning, SI-LLM jointly performs column semantic role identification, entity type clustering, and relation extraction. Evaluated on two real-world datasets, it achieves state-of-the-art or superior end-to-end performance, significantly enhancing semantic interpretability and structuralization efficiency across heterogeneous tables. All code, prompt templates, and datasets are publicly released.
📝 Abstract
Minimally curated tabular data often contain representational inconsistencies across heterogeneous sources, and are accompanied by sparse metadata. Working with such data is intimidating. While prior work has advanced dataset discovery and exploration, schema inference remains difficult when metadata are limited. We present SI-LLM (Schema Inference using Large Language Models), which infers a concise conceptual schema for tabular data using only column headers and cell values. The inferred schema comprises hierarchical entity types, attributes, and inter-type relationships. In extensive evaluation on two datasets from web tables and open data, SI-LLM achieves promising end-to-end results, as well as better or comparable results to state-of-the-art methods at each step. All source code, full prompts, and datasets of SI-LLM are available at https://github.com/PierreWoL/SILLM.