Enhancing Leakage Attacks on Searchable Symmetric Encryption Using LLM-Based Synthetic Data Generation

📅 2025-04-29

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Semantic Search Encryption (SSE) leakage attacks are severely constrained by the scarcity of real-world leakage data under practical threat scenarios. Method: This paper proposes a large language model (LLM)-based synthetic data augmentation framework leveraging GPT-4, which jointly models semantic and statistical similarity and integrates random sampling with hierarchical clustering to generate high-fidelity synthetic documents—specifically designed to enhance token-volume-only Searchable Encryption Attack via Pattern (SAP) keyword inference. Contribution/Results: To our knowledge, this is the first work to incorporate LLMs into SSE attack modeling, thereby relaxing the strong dependence on large-scale real leakage data and establishing a more realistic threat model. Experiments on the Enron email corpus demonstrate that, using only 1% of real leakage data, our method achieves over 85% keyword identification accuracy—matching or approaching the performance of baselines trained on several times more real data—significantly lowering the data requirement for effective SAP attacks.

Technology Category

Application Category

📝 Abstract

Searchable Symmetric Encryption (SSE) enables efficient search capabilities over encrypted data, allowing users to maintain privacy while utilizing cloud storage. However, SSE schemes are vulnerable to leakage attacks that exploit access patterns, search frequency, and volume information. Existing studies frequently assume that adversaries possess a substantial fraction of the encrypted dataset to mount effective inference attacks, implying there is a database leakage of such documents, thus, an assumption that may not hold in real-world scenarios. In this work, we investigate the feasibility of enhancing leakage attacks under a more realistic threat model in which adversaries have access to minimal leaked data. We propose a novel approach that leverages large language models (LLMs), specifically GPT-4 variants, to generate synthetic documents that statistically and semantically resemble the real-world dataset of Enron emails. Using the email corpus as a case study, we evaluate the effectiveness of synthetic data generated via random sampling and hierarchical clustering methods on the performance of the SAP (Search Access Pattern) keyword inference attack restricted to token volumes only. Our results demonstrate that, while the choice of LLM has limited effect, increasing dataset size and employing clustering-based generation significantly improve attack accuracy, achieving comparable performance to attacks using larger amounts of real data. We highlight the growing relevance of LLMs in adversarial contexts.

Problem

Research questions and friction points this paper is trying to address.

Enhancing leakage attacks on SSE with minimal real data

Using LLMs to generate synthetic data for attacks

Improving attack accuracy with clustering-based generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages LLMs for synthetic data generation

Uses clustering to enhance attack accuracy

Focuses on minimal leaked data scenarios

🔎 Similar Papers

A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models

2024-04-20arXiv.orgCitations: 2

💼 Related Jobs

PhD GenAI Research Scientist Intern

Databricks

SF Bay Area Hourly Rate$54—$60 USD

San Francisco, CA, USA

Authors to Follow