CAE: Character-Level Autoencoder for Non-Semantic Relational Data Grouping

📅 2025-11-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Enterprise relational databases contain numerous non-semantic columns (e.g., IP addresses, product codes, timestamps), whose semantic equivalence is difficult to identify automatically due to the absence of explicit schema annotations or domain knowledge. Method: This paper proposes a character-level autoencoder (CAE) constrained by a fixed-size dictionary, enabling unsupervised learning of pattern-aware column-value representations without requiring prior semantic knowledge. The CAE is trained end-to-end to produce highly discriminative embeddings for efficient column clustering, while its fixed dictionary ensures scalability and low memory overhead. Contribution/Results: Evaluated on real-world industrial datasets, the method achieves 80.95% Top-5 column matching accuracy—substantially outperforming bag-of-words baselines (47.62%). To our knowledge, this is the first work to systematically apply a lightweight, character-level neural architecture to large-scale non-semantic column matching, thereby addressing a critical gap in industrial data management.

Technology Category

Application Category

📝 Abstract
Enterprise relational databases increasingly contain vast amounts of non-semantic data - IP addresses, product identifiers, encoded keys, and timestamps - that challenge traditional semantic analysis. This paper introduces a novel Character-Level Autoencoder (CAE) approach that automatically identifies and groups semantically identical columns in non-semantic relational datasets by detecting column similarities based on data patterns and structures. Unlike conventional Natural Language Processing (NLP) models that struggle with limitations in semantic interpretability and out-of-vocabulary tokens, our approach operates at the character level with fixed dictionary constraints, enabling scalable processing of large-scale data lakes and warehouses. The CAE architecture encodes text representations of non-semantic relational table columns and extracts high-dimensional feature embeddings for data grouping. By maintaining a fixed dictionary size, our method significantly reduces both memory requirements and training time, enabling efficient processing of large-scale industrial data environments. Experimental evaluation demonstrates substantial performance gains: our CAE approach achieved 80.95% accuracy in top 5 column matching tasks across relational datasets, substantially outperforming traditional NLP approaches such as Bag of Words (47.62%). These results demonstrate its effectiveness for identifying and clustering identical columns in relational datasets. This work bridges the gap between theoretical advances in character-level neural architectures and practical enterprise data management challenges, providing an automated solution for schema understanding and data profiling of non-semantic industrial datasets at scale.
Problem

Research questions and friction points this paper is trying to address.

Automatically groups semantically identical columns in non-semantic relational datasets
Overcomes limitations of NLP models for non-semantic data like IP addresses
Enables scalable processing of large industrial data lakes with reduced memory
Innovation

Methods, ideas, or system contributions that make the work stand out.

Character-level autoencoder for non-semantic data grouping
Fixed dictionary constraints enable scalable processing
Encodes text representations to extract feature embeddings
V
V. V. S. B. Nunna
Amazon Web Services Inc., Arlington, VA, USA
S
Shinae Kang
Amazon Web Services Inc., Arlington, VA, USA
Zheyuan Zhou
Zheyuan Zhou
Zhejiang University
V
Virginia Wang
Amazon Web Services Inc., Seattle, WA, USA
S
Sucharitha Boinapally
Amazon Web Services Inc., Dallas, TX, USA
M
Michael Foley
Amazon Web Services Inc., Arlington, VA, USA