Committed SAE-Feature Traces for Audited-Session Substitution Detection in Hosted LLMs

📅 2026-04-20

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses the risk that hosted large language model (LLM) providers may silently substitute high-performance models with cheaper alternatives without user consent, a vulnerability exacerbated by the susceptibility of existing verification methods to side-channel attacks via concurrent service requests. To counter this, the authors propose a replacement-resistant verification mechanism that eliminates the need for parallel service queries. Their approach commits, prior to model response generation, to sparse autoencoder (SAE) feature traces from designated layers using a Merkle tree. During verification, it randomly reveals positions in these traces and compares them against a named circuit probe library, employing a joint consistency z-score rule to authenticate the model. By closing side-channel loopholes and integrating cross-backend noise calibration with fixed thresholds, the method achieves stability across model scales. Evaluated on Qwen3-1.7B, Gemma-2-2B, and Gemma-2-9B, it successfully rejects all 17 tested attacks—including same-family upgrades, cross-family substitutions, and adaptive LoRA variants—whereas conventional SVIP methods fail entirely, with commitment overhead adding ≤2.1% inference latency.

Technology Category

Application Category

📝 Abstract

Hosted-LLM providers have a silent-substitution incentive: advertise a stronger model while serving cheaper replies. Probe-after-return schemes such as SVIP leave a parallel-serve side-channel, since a dishonest provider can route the verifier's probe to the advertised model while serving ordinary users from a substitute. We propose a commit-open protocol that closes this gap. Before any opening request, the provider commits via a Merkle tree to a per-position sparse-autoencoder (SAE) feature-trace sketch of its served output at a published probe layer. A verifier opens random positions, scores them against a public named-circuit probe library calibrated with cross-backend noise, and decides with a fixed-threshold joint-consistency z-score rule. We instantiate the protocol on three backbones -- Qwen3-1.7B, Gemma-2-2B, and a 4.5x scale-up to Gemma-2-9B with a 131k-feature SAE. Of 17 attackers spanning same-family lifts, cross-family substitutes, and rank-<=128 adaptive LoRA, all are rejected at a shared, scale-stable threshold; the same attackers all evade a matched SVIP-style parallel-serve baseline. A white-box end-to-end attack that backpropagates through the frozen SAE encoder does not close the margin, and a feature-forgery attacker that never runs M_hon is bounded in closed form by an intrinsic-dimension argument. Commitment adds <=2.1% to forward-only wall-clock at batch 32.

Problem

Research questions and friction points this paper is trying to address.

model substitution

hosted LLMs

audit

side-channel

trust verification

Innovation

Methods, ideas, or system contributions that make the work stand out.

commit-open protocol

sparse autoencoder (SAE)

model substitution detection