🤖 AI Summary
Existing approaches struggle to efficiently and privacy-safely identify semantic types of table columns due to the scarcity of labeled data, the large number of possible semantic types, and reliance on proprietary large language models. To address these challenges, this work proposes ZTab, a domain-aware zero-shot column type recognition framework. ZTab leverages a domain configuration mechanism to generate synthetic tables and fine-tunes open-source large language models to support three zero-shot settings, enabling accurate cross-table type inference within the same domain without requiring user-provided annotations. The method maintains broad applicability while significantly improving accuracy under large-scale semantic type vocabularies and effectively reducing dependence on closed-source models and associated privacy risks.
📝 Abstract
This study addresses the challenge of automatically detecting semantic column types in relational tables, a key task in many real-world applications. Zero-shot modeling eliminates the need for user-provided labeled training data, making it ideal for scenarios where data collection is costly or restricted due to privacy concerns. However, existing zero-shot models suffer from poor performance when the number of semantic column types is large, limited understanding of tabular structure, and privacy risks arising from dependence on high-performance closed-source LLMs. We introduce ZTab, a domain-based zero-shot framework that addresses both performance and zero-shot requirements. Given a domain configuration consisting of a set of predefined semantic types and sample table schemas, ZTab generates pseudo-tables for the sample schemas and fine-tunes an annotation LLM on them. ZTab is domain-based zero-shot in that it does not depend on user-specific labeled training data; therefore, no retraining is needed for a test table from a similar domain. We describe three cases of domain-based zero-shot. The domain configuration of ZTab provides a trade-off between the extent of zero-shot and annotation performance: a "universal domain" that contains all semantic types approaches "pure" zero-shot, while a "specialized domain" that contains semantic types for a specific application enables better zero-shot performance within that domain. Source code and datasets are available at https://github.com/hoseinzadeehsan/ZTab