Probing the Vulnerability of Large Language Models to Polysemantic Interventions

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work identifies a security vulnerability in large language models (LLMs) arising from neuronal polysemy—where individual neurons encode multiple semantically unrelated features. Method: We propose a four-layer intervention framework (prompt, token, feature, neuron) and integrate sparse autoencoders to achieve feature disentanglement and precise neuron-level localization. Contribution/Results: We empirically demonstrate, for the first time, that polysemous structures exhibit stable cross-scale and cross-architecture transferability on Pythia-70M and GPT-2-Small. Furthermore, we successfully transfer our intervention strategy to black-box, instruction-tuned models—LLaMA3.1-8B-Instruct and Gemma-2-9B-Instruct—enabling cross-layer, robust implicit attacks. This study provides the first systematic empirical validation of the generalizability and transferability of polysemous topologies, establishing a novel paradigm for understanding the security boundaries of internal representations in LLMs.

Technology Category

Application Category

📝 Abstract

Polysemanticity -- where individual neurons encode multiple unrelated features -- is a well-known characteristic of large neural networks and remains a central challenge in the interpretability of language models. At the same time, its implications for model safety are also poorly understood. Leveraging recent advances in sparse autoencoders, we investigate the polysemantic structure of two small models (Pythia-70M and GPT-2-Small) and evaluate their vulnerability to targeted, covert interventions at the prompt, feature, token, and neuron levels. Our analysis reveals a consistent polysemantic topology shared across both models. Strikingly, we demonstrate that this structure can be exploited to mount effective interventions on two larger, black-box instruction-tuned models (LLaMA3.1-8B-Instruct and Gemma-2-9B-Instruct). These findings suggest not only the generalizability of the interventions but also point to a stable and transferable polysemantic structure that could potentially persist across architectures and training regimes.

Problem

Research questions and friction points this paper is trying to address.

Investigating vulnerability of LLMs to polysemantic interventions

Analyzing polysemantic structure in small and large language models

Demonstrating generalizability of interventions across different model architectures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging sparse autoencoders to analyze polysemanticity

Targeting interventions at multiple model levels

Exploiting polysemantic structure in black-box models

🔎 Similar Papers

Does Liking Yellow Imply Driving a School Bus? Semantic Leakage in Language Models