Privasis: Synthesizing the Largest"Public"Private Dataset from Scratch

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

This work addresses the long-standing challenge in privacy-sensitive data research—namely, the scarcity of real-world datasets that can adequately support AI systems in handling personal sensitive information. To overcome this limitation, we introduce Privasis, the first large-scale, fully synthetic privacy dataset generated from scratch, encompassing domains such as healthcare, legal, and finance, and containing 55.1 million annotated attributes. Leveraging a pipeline of textual decomposition and targeted de-identification combined with advanced synthetic data generation techniques, we develop and train a lightweight de-identification model with ≤4 billion parameters. Experimental results demonstrate that our model outperforms state-of-the-art large language models—including GPT-5 and Qwen-3 235B—on text de-identification tasks. The dataset, model, and code will be publicly released to advance AI research in privacy-sensitive domains.

Technology Category

Application Category

📝 Abstract

Research involving privacy-sensitive data has always been constrained by data scarcity, standing in sharp contrast to other areas that have benefited from data scaling. This challenge is becoming increasingly urgent as modern AI agents--such as OpenClaw and Gemini Agent--are granted persistent access to highly sensitive personal information. To tackle this longstanding bottleneck and the rising risks, we present Privasis (i.e., privacy oasis), the first million-scale fully synthetic dataset entirely built from scratch--an expansive reservoir of texts with rich and diverse private information--designed to broaden and accelerate research in areas where processing sensitive social data is inevitable. Compared to existing datasets, Privasis, comprising 1.4 million records, offers orders-of-magnitude larger scale with quality, and far greater diversity across various document types, including medical history, legal documents, financial records, calendars, and text messages with a total of 55.1 million annotated attributes such as ethnicity, date of birth, workplace, etc. We leverage Privasis to construct a parallel corpus for text sanitization with our pipeline that decomposes texts and applies targeted sanitization. Our compact sanitization models (<=4B) trained on this dataset outperform state-of-the-art large language models, such as GPT-5 and Qwen-3 235B. We plan to release data, models, and code to accelerate future research on privacy-sensitive domains and agents.

Problem

Research questions and friction points this paper is trying to address.

privacy-sensitive data

data scarcity

synthetic dataset

AI agents

personal information

Innovation

Methods, ideas, or system contributions that make the work stand out.

synthetic dataset

privacy-preserving AI

text sanitization