Comparing Reconstruction Attacks on Pretrained Versus Full Fine-tuned Large Language Model Embeddings on Homo Sapiens Splice Sites Genomic Data

📅 2025-11-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates how fine-tuning affects the privacy vulnerability of large language models (LLMs) to embedding reconstruction attacks on *Homo sapiens* splice-site genomic data. To account for DNA sequence characteristics, we propose a specialized nucleotide tokenization scheme and systematically evaluate reconstruction attack success rates across XLNet, GPT-2, and BERT—comparing pre-trained versus fully fine-tuned variants on the HS3D dataset. Results demonstrate that task-specific fine-tuning significantly enhances model robustness against such attacks: reconstruction success rates decrease by 19.8%, 9.8%, and 7.8% for XLNet, GPT-2, and BERT, respectively. This work provides the first empirical evidence in genomic LLMs that fine-tuning confers a privacy-enhancing effect—challenging the conventional assumption that adaptation invariably increases memorization risk. Our findings establish a novel paradigm and actionable framework for privacy-aware design of biomedical foundation models.

Technology Category

Application Category

📝 Abstract
This study investigates embedding reconstruction attacks in large language models (LLMs) applied to genomic sequences, with a specific focus on how fine-tuning affects vulnerability to these attacks. Building upon Pan et al.'s seminal work demonstrating that embeddings from pretrained language models can leak sensitive information, we conduct a comprehensive analysis using the HS3D genomic dataset to determine whether task-specific optimization strengthens or weakens privacy protections. Our research extends Pan et al.'s work in three significant dimensions. First, we apply their reconstruction attack pipeline to pretrained and fine-tuned model embeddings, addressing a critical gap in their methodology that did not specify embedding types. Second, we implement specialized tokenization mechanisms tailored specifically for DNA sequences, enhancing the model's ability to process genomic data, as these models are pretrained on natural language and not DNA. Third, we perform a detailed comparative analysis examining position-specific, nucleotide-type, and privacy changes between pretrained and fine-tuned embeddings. We assess embeddings vulnerabilities across different types and dimensions, providing deeper insights into how task adaptation shifts privacy risks throughout genomic sequences. Our findings show a clear distinction in reconstruction vulnerability between pretrained and fine-tuned embeddings. Notably, fine-tuning strengthens resistance to reconstruction attacks in multiple architectures -- XLNet (+19.8%), GPT-2 (+9.8%), and BERT (+7.8%) -- pointing to task-specific optimization as a potential privacy enhancement mechanism. These results highlight the need for advanced protective mechanisms for language models processing sensitive genomic data, while highlighting fine-tuning as a potential privacy-enhancing technique worth further exploration.
Problem

Research questions and friction points this paper is trying to address.

Investigates embedding reconstruction attacks on genomic sequences in large language models
Compares privacy vulnerabilities between pretrained and fine-tuned model embeddings
Assesses how task-specific optimization affects reconstruction attack resistance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Applied reconstruction attacks to pretrained and fine-tuned embeddings
Implemented specialized tokenization mechanisms for DNA sequences
Performed comparative analysis of privacy changes between embeddings
🔎 Similar Papers
No similar papers found.