SAVeD: Semantic Aware Version Discovery

📅 2025-11-21

📈 Citations: 0

✨ Influential: 0

career value

138K/year

🤖 AI Summary

In data science, structural transformations applied to datasets obscure version lineage, leading to redundant effort in data curation and integration. To address this, we propose an unsupervised, semantic-aware version discovery framework that infers version relationships without relying on metadata or predefined schema-matching assumptions. Our method leverages contrastive learning to automatically model cross-dataset semantic similarity. Specifically, we adapt the SimCLR framework by incorporating row-dropping and encoding perturbation strategies to enhance representation robustness, and introduce a lightweight Transformer-based encoder tailored for learning semantically rich tabular embeddings. Extensive experiments across five benchmark datasets demonstrate substantial improvements in version discrimination: our approach achieves higher classification accuracy and better separation scores than state-of-the-art methods—including Starmie—while exhibiting strong generalization to unseen tables, confirming its practical utility and scalability.

Technology Category

Application Category

📝 Abstract

Our work introduces SAVeD (Semantically Aware Version Detection), a contrastive learning-based framework for identifying versions of structured datasets without relying on metadata, labels, or integration-based assumptions. SAVeD addresses a common challenge in data science of repeated labor due to a difficulty of similar work or transformations on datasets. SAVeD employs a modified SimCLR pipeline, generating augmented table views through random transformations (e.g., row deletion, encoding perturbations). These views are embedded via a custom transformer encoder and contrasted in latent space to optimize semantic similarity. Our model learns to minimize distances between augmented views of the same dataset and maximize those between unrelated tables. We evaluate performance using validation accuracy and separation, defined respectively as the proportion of correctly classified version/non-version pairs on a hold-out set, and the difference between average similarities of versioned and non-versioned tables (defined by a benchmark, and not provided to the model). Our experiments span five canonical datasets from the Semantic Versioning in Databases Benchmark, and demonstrate substantial gains post-training. SAVeD achieves significantly higher accuracy on completely unseen tables in, and a significant boost in separation scores, confirming its capability to distinguish semantically altered versions. Compared to untrained baselines and prior state-of-the-art dataset-discovery methods like Starmie, our custom encoder achieves competitive or superior results.

Problem

Research questions and friction points this paper is trying to address.

Identifying dataset versions without metadata or labels

Reducing repeated labor from similar dataset transformations

Distinguishing semantically altered versions of structured datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses contrastive learning for dataset version detection

Applies random transformations to create augmented table views

Employs custom transformer encoder for semantic similarity optimization

🔎 Similar Papers

OM4OV: Leveraging Ontology Matching for Ontology Versioning