AttenMIA: LLM Membership Inference Attack through Attention Signals

📅 2026-01-26

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the privacy and intellectual property risks posed by large language models (LLMs) memorizing training data, a vulnerability inadequately exploited by existing membership inference attacks. The authors propose a novel membership inference method that leverages the Transformer’s self-attention mechanism: by analyzing information flow patterns across multiple attention heads and integrating perturbation-induced divergence metrics, they construct a highly discriminative classifier. This approach is the first to utilize attention signals for membership inference, revealing that while attention mechanisms enhance model interpretability, they simultaneously exacerbate privacy leakage. Evaluated on benchmarks such as WikiMIA-32, the method significantly outperforms prior techniques—achieving a 0.996 ROC AUC and 87.9% TPR@1%FPR on Llama2-13b—and demonstrates strong generalization across datasets and model architectures, thereby substantially strengthening training data extraction attacks.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly deployed to enable or improve a multitude of real-world applications. Given the large size of their training data sets, their tendency to memorize training data raises serious privacy and intellectual property concerns. A key threat is the membership inference attack (MIA), which aims to determine whether a given sample was included in the model's training set. Existing MIAs for LLMs rely primarily on output confidence scores or embedding-based features, but these signals are often brittle, leading to limited attack success. We introduce AttenMIA, a new MIA framework that exploits self-attention patterns inside the transformer model to infer membership. Attention controls the information flow within the transformer, exposing different patterns for memorization that can be used to identify members of the dataset. Our method uses information from attention heads across layers and combines them with perturbation-based divergence metrics to train an effective MIA classifier. Using extensive experiments on open-source models including LLaMA-2, Pythia, and Opt models, we show that attention-based features consistently outperform baselines, particularly under the important low-false-positive metric (e.g., achieving up to 0.996 ROC AUC&87.9% TPR@1%FPR on the WikiMIA-32 benchmark with Llama2-13b). We show that attention signals generalize across datasets and architectures, and provide a layer- and head-level analysis of where membership leakage is most pronounced. We also show that using AttenMIA to replace other membership inference attacks in a data extraction framework results in training data extraction attacks that outperform the state of the art. Our findings reveal that attention mechanisms, originally introduced to enhance interpretability, can inadvertently amplify privacy risks in LLMs, underscoring the need for new defenses.

Problem

Research questions and friction points this paper is trying to address.

Membership Inference Attack

Large Language Models

Privacy Risk

Training Data Memorization

Attention Mechanism

Innovation

Methods, ideas, or system contributions that make the work stand out.

Membership Inference Attack

Attention Mechanism

Large Language Models