Fragments to Facts: Partial-Information Fragment Inference from LLMs

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work uncovers a novel privacy vulnerability in fine-tuned large language models (LLMs) under weak adversarial assumptions: an attacker possessing only fragmented, unordered sensitive tokens—e.g., isolated disease names—can infer other semantically associated sensitive attributes (e.g., inferring “osteoarthritis” from “hypertension”). To formalize this threat, we introduce the first general threat model for unordered fragment extraction. We propose PRISM, a data-agnostic inference method that regularizes likelihood ratios using external priors, eliminating dependence on target data distribution or labels. We validate PRISM’s zero-shot and few-shot attack efficacy across medical and legal domains. Experiments demonstrate that PRISM achieves performance on par with supervised baselines—even without access to labels—establishing, for the first time, that fine-tuned LLMs exhibit high inferential susceptibility to unordered sensitive fragments. This work establishes a new paradigm for assessing privacy risks in fine-tuned LLMs.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) can leak sensitive training data through memorization and membership inference attacks. Prior work has primarily focused on strong adversarial assumptions, including attacker access to entire samples or long, ordered prefixes, leaving open the question of how vulnerable LLMs are when adversaries have only partial, unordered sample information. For example, if an attacker knows a patient has"hypertension,"under what conditions can they query a model fine-tuned on patient data to learn the patient also has"osteoarthritis?"In this paper, we introduce a more general threat model under this weaker assumption and show that fine-tuned LLMs are susceptible to these fragment-specific extraction attacks. To systematically investigate these attacks, we propose two data-blind methods: (1) a likelihood ratio attack inspired by methods from membership inference, and (2) a novel approach, PRISM, which regularizes the ratio by leveraging an external prior. Using examples from both medical and legal settings, we show that both methods are competitive with a data-aware baseline classifier that assumes access to labeled in-distribution data, underscoring their robustness.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLM vulnerability to attacks with partial, unordered data

Developing fragment-specific extraction attacks under weak adversarial assumptions

Evaluating data-blind methods for sensitive information leakage in fine-tuned LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Likelihood ratio attack for fragment inference

PRISM method with external prior regularization

Data-blind techniques for extraction attacks

🔎 Similar Papers

Rethinking Semantic Parsing for Large Language Models: Enhancing LLM Performance with Semantic Hints