🤖 AI Summary
Enterprise relational databases contain numerous non-semantic columns (e.g., IP addresses, product codes, timestamps), whose semantic equivalence is difficult to identify automatically due to the absence of explicit schema annotations or domain knowledge. Method: This paper proposes a character-level autoencoder (CAE) constrained by a fixed-size dictionary, enabling unsupervised learning of pattern-aware column-value representations without requiring prior semantic knowledge. The CAE is trained end-to-end to produce highly discriminative embeddings for efficient column clustering, while its fixed dictionary ensures scalability and low memory overhead. Contribution/Results: Evaluated on real-world industrial datasets, the method achieves 80.95% Top-5 column matching accuracy—substantially outperforming bag-of-words baselines (47.62%). To our knowledge, this is the first work to systematically apply a lightweight, character-level neural architecture to large-scale non-semantic column matching, thereby addressing a critical gap in industrial data management.
📝 Abstract
Enterprise relational databases increasingly contain vast amounts of non-semantic data - IP addresses, product identifiers, encoded keys, and timestamps - that challenge traditional semantic analysis. This paper introduces a novel Character-Level Autoencoder (CAE) approach that automatically identifies and groups semantically identical columns in non-semantic relational datasets by detecting column similarities based on data patterns and structures. Unlike conventional Natural Language Processing (NLP) models that struggle with limitations in semantic interpretability and out-of-vocabulary tokens, our approach operates at the character level with fixed dictionary constraints, enabling scalable processing of large-scale data lakes and warehouses. The CAE architecture encodes text representations of non-semantic relational table columns and extracts high-dimensional feature embeddings for data grouping. By maintaining a fixed dictionary size, our method significantly reduces both memory requirements and training time, enabling efficient processing of large-scale industrial data environments. Experimental evaluation demonstrates substantial performance gains: our CAE approach achieved 80.95% accuracy in top 5 column matching tasks across relational datasets, substantially outperforming traditional NLP approaches such as Bag of Words (47.62%). These results demonstrate its effectiveness for identifying and clustering identical columns in relational datasets. This work bridges the gap between theoretical advances in character-level neural architectures and practical enterprise data management challenges, providing an automated solution for schema understanding and data profiling of non-semantic industrial datasets at scale.