Breaking Bad: Interpretability-Based Safety Audits of State-of-the-Art LLMs

πŸ“… 2026-04-22
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

245K/year
πŸ€– AI Summary
This study addresses the lack of systematic internal analysis in current safety audits of large language models, which often fail to uncover deep-seated vulnerabilities. To bridge this gap, the work proposes an interpretability-driven activation intervention method by integrating Universal Steering with Representation Engineering for the first time. An adaptive two-stage grid search strategy is designed to optimize intervention parameters, enabling jailbreaking audits across eight prominent open-source models. Experimental results reveal that the Llama-3 series exhibits high susceptibility (with a jailbreak success rate of up to 91%), while GPT-oss-120B demonstrates robustness. Notably, Qwen and Phi models show significant performance disparities across scales. These findings validate the method’s efficacy in linking internal model representations to unsafe behaviors and underscore the dual-edged nature of interpretability techniques in security auditing.

Technology Category

Application Category

πŸ“ Abstract
Effective safety auditing of large language models (LLMs) demands tools that go beyond black-box probing and systematically uncover vulnerabilities rooted in model internals. We present a comprehensive, interpretability-driven jailbreaking audit of eight SOTA open-source LLMs: Llama-3.1-8B, Llama-3.3-70B-4bt, GPT-oss- 20B, GPT-oss-120B, Qwen3-0.6B, Qwen3-32B, Phi4-3.8B, and Phi4-14B. Leveraging interpretability-based approaches -- Universal Steering (US) and Representation Engineering (RepE) -- we introduce an adaptive two-stage grid search algorithm to identify optimal activation-steering coefficients for unsafe behavioral concepts. Our evaluation, conducted on a curated set of harmful queries and a standardized LLM-based judging protocol, reveals stark contrasts in model robustness. The Llama-3 models are highly vulnerable, with up to 91\% (US) and 83\% (RepE) jailbroken responses on Llama-3.3-70B-4bt, while GPT-oss-120B remains robust to attacks via both interpretability approaches. Qwen and Phi models show mixed results, with the smaller Qwen3-0.6B and Phi4-3.8B mostly exhibiting lower jailbreaking rates, while their larger counterparts are more susceptible. Our results establish interpretability-based steering as a powerful tool for systematic safety audits, but also highlight its dual-use risks and the need for better internal defenses in LLM deployment.
Problem

Research questions and friction points this paper is trying to address.

safety auditing
large language models
jailbreaking
interpretability
model vulnerabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

interpretability-based auditing
Universal Steering
Representation Engineering
jailbreaking vulnerability
activation steering
K
Krishiv Agarwal
NuSCI Research Group, Computer Science Laboratory, SRI
Ramneet Kaur
Ramneet Kaur
Advanced Computer Scientist, SRI
Trustworthy AIInterpretabilityReliabilityConformal PredictionGenAI
C
Colin Samplawski
NuSCI Research Group, Computer Science Laboratory, SRI
Manoj Acharya
Manoj Acharya
SRI International
Artificial IntelligenceComputer VisionNLPVisual Question Answering
Anirban Roy
Anirban Roy
Principal Scientist, SRI International
Computer VisionMachine LearningDeep LearningNeural Networks
D
Daniel Elenius
NuSCI Research Group, Computer Science Laboratory, SRI
B
Brian Matejek
NuSCI Research Group, Computer Science Laboratory, SRI
A
Adam D. Cobb
NuSCI Research Group, Computer Science Laboratory, SRI
Susmit Jha
Susmit Jha
Director, Neurosymbolic Computing and Intelligence, SRI International
Aritificial IntelligenceAutonomyFormal MethodsMachine Learning