Continual Pretraining on Encrypted Synthetic Data for Privacy-Preserving LLMs

📅 2026-01-09

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This study addresses the challenge of preserving personally identifiable information (PII) privacy when continually pre-training large language models on small-scale, domain-specific corpora. The authors propose a synthetic data generation framework guided by weighted entity graphs, integrating deterministic encryption with an authorized decryption mechanism to enable secure model updates and instruction-following capabilities without compromising PII confidentiality. This work presents the first exploration of continual pre-training on encrypted synthetic data, leveraging graph-based modeling of entity relationships to enhance data utility. Experimental results demonstrate that, under data-scarce conditions, models pre-trained on encrypted data significantly outperform baseline approaches while guaranteeing PII security, albeit with a slight performance gap compared to unencrypted training. Further improvements are achieved by increasing the number of entities and incorporating graph-guided synthesis.

Technology Category

Application Category

📝 Abstract

Preserving privacy in sensitive data while pretraining large language models on small, domain-specific corpora presents a significant challenge. In this work, we take an exploratory step toward privacy-preserving continual pretraining by proposing an entity-based framework that synthesizes encrypted training data to protect personally identifiable information (PII). Our approach constructs a weighted entity graph to guide data synthesis and applies deterministic encryption to PII entities, enabling LLMs to encode new knowledge through continual pretraining while granting authorized access to sensitive data through decryption keys. Our results on limited-scale datasets demonstrate that our pretrained models outperform base models and ensure PII security, while exhibiting a modest performance gap compared to models trained on unencrypted synthetic data. We further show that increasing the number of entities and leveraging graph-based synthesis improves model performance, and that encrypted models retain instruction-following capabilities with long retrieved contexts. We discuss the security implications and limitations of deterministic encryption, positioning this work as an initial investigation into the design space of encrypted data pretraining for privacy-preserving LLMs. Our code is available at https://github.com/DataArcTech/SoE.

Problem

Research questions and friction points this paper is trying to address.

privacy-preserving

continual pretraining

encrypted synthetic data

personally identifiable information

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

continual pretraining

encrypted synthetic data

privacy-preserving LLMs