🤖 AI Summary
Unsafe behaviors of large language models (LLMs) in real-world deployment pose critical risks, yet existing interpretability methods lack a unified, security-oriented analytical framework spanning the full model workflow—from input to reasoning to output.
Method: We conduct a systematic review of over 70 works and propose the first security-aware interpretability taxonomy grounded in LLM workflow stages. Integrating literature analysis, cross-method comparison, security-interpretability joint modeling, and toolchain mapping, we construct a structured knowledge graph that uncovers causal pathways through which explanation techniques enhance safety.
Contributions/Results: (1) We establish a theoretical paradigm for co-optimizing interpretability and safety; (2) we identify key technical gaps hindering practical adoption; and (3) we deliver an actionable, dual-audience guideline—tailored for both researchers and practitioners—to bridge theory and implementation in secure LLM development.
📝 Abstract
As large language models (LLMs) see wider real-world use, understanding and mitigating their unsafe behaviors is critical. Interpretation techniques can reveal causes of unsafe outputs and guide safety, but such connections with safety are often overlooked in prior surveys. We present the first survey that bridges this gap, introducing a unified framework that connects safety-focused interpretation methods, the safety enhancements they inform, and the tools that operationalize them. Our novel taxonomy, organized by LLM workflow stages, summarizes nearly 70 works at their intersections. We conclude with open challenges and future directions. This timely survey helps researchers and practitioners navigate key advancements for safer, more interpretable LLMs.