Breaking Bad: Interpretability-Based Safety Audits of State-of-the-Art LLMs

📅 2026-04-22

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

This study addresses the lack of systematic internal analysis in current safety audits of large language models, which often fail to uncover deep-seated vulnerabilities. To bridge this gap, the work proposes an interpretability-driven activation intervention method by integrating Universal Steering with Representation Engineering for the first time. An adaptive two-stage grid search strategy is designed to optimize intervention parameters, enabling jailbreaking audits across eight prominent open-source models. Experimental results reveal that the Llama-3 series exhibits high susceptibility (with a jailbreak success rate of up to 91%), while GPT-oss-120B demonstrates robustness. Notably, Qwen and Phi models show significant performance disparities across scales. These findings validate the method’s efficacy in linking internal model representations to unsafe behaviors and underscore the dual-edged nature of interpretability techniques in security auditing.

Technology Category

Application Category

📝 Abstract

Effective safety auditing of large language models (LLMs) demands tools that go beyond black-box probing and systematically uncover vulnerabilities rooted in model internals. We present a comprehensive, interpretability-driven jailbreaking audit of eight SOTA open-source LLMs: Llama-3.1-8B, Llama-3.3-70B-4bt, GPT-oss- 20B, GPT-oss-120B, Qwen3-0.6B, Qwen3-32B, Phi4-3.8B, and Phi4-14B. Leveraging interpretability-based approaches -- Universal Steering (US) and Representation Engineering (RepE) -- we introduce an adaptive two-stage grid search algorithm to identify optimal activation-steering coefficients for unsafe behavioral concepts. Our evaluation, conducted on a curated set of harmful queries and a standardized LLM-based judging protocol, reveals stark contrasts in model robustness. The Llama-3 models are highly vulnerable, with up to 91\% (US) and 83\% (RepE) jailbroken responses on Llama-3.3-70B-4bt, while GPT-oss-120B remains robust to attacks via both interpretability approaches. Qwen and Phi models show mixed results, with the smaller Qwen3-0.6B and Phi4-3.8B mostly exhibiting lower jailbreaking rates, while their larger counterparts are more susceptible. Our results establish interpretability-based steering as a powerful tool for systematic safety audits, but also highlight its dual-use risks and the need for better internal defenses in LLM deployment.

Problem

Research questions and friction points this paper is trying to address.

safety auditing

large language models

jailbreaking

interpretability

model vulnerabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

interpretability-based auditing

Universal Steering

Representation Engineering