CAE: Character-Level Autoencoder for Non-Semantic Relational Data Grouping

📅 2025-11-10

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Enterprise relational databases contain numerous non-semantic columns (e.g., IP addresses, product codes, timestamps), whose semantic equivalence is difficult to identify automatically due to the absence of explicit schema annotations or domain knowledge. Method: This paper proposes a character-level autoencoder (CAE) constrained by a fixed-size dictionary, enabling unsupervised learning of pattern-aware column-value representations without requiring prior semantic knowledge. The CAE is trained end-to-end to produce highly discriminative embeddings for efficient column clustering, while its fixed dictionary ensures scalability and low memory overhead. Contribution/Results: Evaluated on real-world industrial datasets, the method achieves 80.95% Top-5 column matching accuracy—substantially outperforming bag-of-words baselines (47.62%). To our knowledge, this is the first work to systematically apply a lightweight, character-level neural architecture to large-scale non-semantic column matching, thereby addressing a critical gap in industrial data management.

Technology Category

Application Category

📝 Abstract

Enterprise relational databases increasingly contain vast amounts of non-semantic data - IP addresses, product identifiers, encoded keys, and timestamps - that challenge traditional semantic analysis. This paper introduces a novel Character-Level Autoencoder (CAE) approach that automatically identifies and groups semantically identical columns in non-semantic relational datasets by detecting column similarities based on data patterns and structures. Unlike conventional Natural Language Processing (NLP) models that struggle with limitations in semantic interpretability and out-of-vocabulary tokens, our approach operates at the character level with fixed dictionary constraints, enabling scalable processing of large-scale data lakes and warehouses. The CAE architecture encodes text representations of non-semantic relational table columns and extracts high-dimensional feature embeddings for data grouping. By maintaining a fixed dictionary size, our method significantly reduces both memory requirements and training time, enabling efficient processing of large-scale industrial data environments. Experimental evaluation demonstrates substantial performance gains: our CAE approach achieved 80.95% accuracy in top 5 column matching tasks across relational datasets, substantially outperforming traditional NLP approaches such as Bag of Words (47.62%). These results demonstrate its effectiveness for identifying and clustering identical columns in relational datasets. This work bridges the gap between theoretical advances in character-level neural architectures and practical enterprise data management challenges, providing an automated solution for schema understanding and data profiling of non-semantic industrial datasets at scale.

Problem

Research questions and friction points this paper is trying to address.

Automatically groups semantically identical columns in non-semantic relational datasets

Overcomes limitations of NLP models for non-semantic data like IP addresses

Enables scalable processing of large industrial data lakes with reduced memory

Innovation

Methods, ideas, or system contributions that make the work stand out.

Character-level autoencoder for non-semantic data grouping

Fixed dictionary constraints enable scalable processing

Encodes text representations to extract feature embeddings

🔎 Similar Papers

AMR-RE: Abstract Meaning Representations for Retrieval-Based In-Context Learning in Relation Extraction