Datum-wise Transformer for Synthetic Tabular Data Detection in the Wild

📅 2025-04-10

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

Detecting synthetic tabular data in real-world scenarios remains challenging due to unknown table structures and poor cross-table generalization. Method: This paper proposes a datum-wise Transformer architecture that models each row independently, eliminating assumptions about column count or type. It incorporates column-aware embedding, dynamic masked attention, and adversarial domain alignment to enable zero-shot structural transfer and robust detection across heterogeneous distributions. Contribution/Results: This work is the first to introduce the datum-wise paradigm to synthetic tabular data detection, supporting arbitrary, previously unseen table schemas. Evaluated on a multi-source, heterogeneous benchmark, our method achieves an 8.2% AUC improvement over state-of-the-art approaches. The framework delivers scalable, highly generalizable synthetic-data attribution—critical for industrial and governmental applications requiring reliable provenance tracing of tabular content.

Technology Category

Application Category

📝 Abstract

The growing power of generative models raises major concerns about the authenticity of published content. To address this problem, several synthetic content detection methods have been proposed for uniformly structured media such as image or text. However, little work has been done on the detection of synthetic tabular data, despite its importance in industry and government. This form of data is complex to handle due to the diversity of its structures: the number and types of the columns may vary wildly from one table to another. We tackle the tough problem of detecting synthetic tabular data ''in the wild'', i.e. when the model is deployed on table structures it has never seen before. We introduce a novel datum-wise transformer architecture and show that it outperforms existing models. Furthermore, we investigate the application of domain adaptation techniques to enhance the effectiveness of our model, thereby providing a more robust data-forgery detection solution.

Problem

Research questions and friction points this paper is trying to address.

Detecting synthetic tabular data in diverse real-world structures

Addressing lack of methods for non-uniform tabular data detection

Improving robustness with domain adaptation for data-forgery detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Datum-wise transformer for diverse tabular data

Domain adaptation enhances detection robustness

Outperforms existing synthetic data detection models

🔎 Similar Papers

Fine-tuned In-Context Learning Transformers are Excellent Tabular Data Classifiers