Verifying LLM Inference to Prevent Model Weight Exfiltration

📅 2025-11-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) face security risks during inference, where attackers may exfiltrate model weights via steganographic leakage in generated responses. Method: We propose the first verifiable-response defense framework, modeling steganographic leakage as a security game with formally defined trust assumptions and sources of nondeterminism. Our approach features a provably secure output verification mechanism grounded in formal safety analysis, complemented by two lightweight estimators for efficient anomaly detection. Contribution/Results: Evaluated on models ranging from 3B to 30B parameters, our framework reduces steganographic information leakage to below 0.5% on MOE-Qwen-30B, achieves a false positive rate of only 0.01%, and degrades attacker efficacy by over 200×. This work establishes the first practical, verifiable, and formally proven defense against steganographic weight extraction from LLM inference outputs.

Technology Category

Application Category

📝 Abstract
As large AI models become increasingly valuable assets, the risk of model weight exfiltration from inference servers grows accordingly. An attacker controlling an inference server may exfiltrate model weights by hiding them within ordinary model outputs, a strategy known as steganography. This work investigates how to verify model responses to defend against such attacks and, more broadly, to detect anomalous or buggy behavior during inference. We formalize model exfiltration as a security game, propose a verification framework that can provably mitigate steganographic exfiltration, and specify the trust assumptions associated with our scheme. To enable verification, we characterize valid sources of non-determinism in large language model inference and introduce two practical estimators for them. We evaluate our detection framework on several open-weight models ranging from 3B to 30B parameters. On MOE-Qwen-30B, our detector reduces exfiltratable information to<0.5% with false-positive rate of 0.01%, corresponding to a>200x slowdown for adversaries. Overall, this work further establishes a foundation for defending against model weight exfiltration and demonstrates that strong protection can be achieved with minimal additional cost to inference providers.
Problem

Research questions and friction points this paper is trying to address.

Preventing model weight exfiltration through steganography in LLM inference
Detecting anomalous or buggy behavior during AI model inference
Verifying model responses to mitigate steganographic exfiltration attacks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Formalizes model exfiltration as security game
Introduces two practical estimators for non-determinism
Proposes verification framework to mitigate steganographic exfiltration
🔎 Similar Papers
No similar papers found.
R
Roy Rinberg
Harvard University
Adam Karvonen
Adam Karvonen
ML Researcher
Machine Learning
A
Alex Hoover
Stevens Institute of Technology
D
Daniel Reuter
ML Alignment & Theory Scholars (MATS)
K
Keri Warr
Anthropic