Relationship Detection on Tabular Data Using Statistical Analysis and Large Language Models

📅 2025-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the Column Pair Alignment (CPA) task for unlabeled tabular data, aiming to automatically identify semantic relationships between columns. We propose a statistical-guided large language model (LLM) framework: first, lightweight statistical constraints—derived from domain/range detection and relational co-occurrence analysis—are constructed to substantially prune the LLM’s candidate relationship search space; second, knowledge graph alignment, multi-granularity quantification of LLM reasoning outputs, and diverse prompt engineering are integrated to enhance both accuracy and efficiency. Evaluated on the SemTab benchmark, our method achieves state-of-the-art performance. Open-sourced code confirms its effectiveness, robustness, and reproducibility. The core contribution is the first introduction of lightweight statistical constraints as structured priors for LLM inference—enabling synergistic optimization of high accuracy and computational efficiency in CPA.

Technology Category

Application Category

📝 Abstract
Over the past few years, table interpretation tasks have made significant progress due to their importance and the introduction of new technologies and benchmarks in the field. This work experiments with a hybrid approach for detecting relationships among columns of unlabeled tabular data, using a Knowledge Graph (KG) as a reference point, a task known as CPA. This approach leverages large language models (LLMs) while employing statistical analysis to reduce the search space of potential KG relations. The main modules of this approach for reducing the search space are domain and range constraints detection, as well as relation co-appearance analysis. The experimental evaluation on two benchmark datasets provided by the SemTab challenge assesses the influence of each module and the effectiveness of different state-of-the-art LLMs at various levels of quantization. The experiments were performed, as well as at different prompting techniques. The proposed methodology, which is publicly available on github, proved to be competitive with state-of-the-art approaches on these datasets.
Problem

Research questions and friction points this paper is trying to address.

Detecting relationships in unlabeled tabular data using hybrid methods
Reducing KG relation search space via statistical analysis and LLMs
Evaluating domain, range constraints, and relation co-appearance for CPA
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid approach combining LLMs and statistical analysis
Uses Knowledge Graph for relationship detection
Reduces search space with domain and range constraints