Interpretation Meets Safety: A Survey on Interpretation Methods and Tools for Improving LLM Safety

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Unsafe behaviors of large language models (LLMs) in real-world deployment pose critical risks, yet existing interpretability methods lack a unified, security-oriented analytical framework spanning the full model workflow—from input to reasoning to output. Method: We conduct a systematic review of over 70 works and propose the first security-aware interpretability taxonomy grounded in LLM workflow stages. Integrating literature analysis, cross-method comparison, security-interpretability joint modeling, and toolchain mapping, we construct a structured knowledge graph that uncovers causal pathways through which explanation techniques enhance safety. Contributions/Results: (1) We establish a theoretical paradigm for co-optimizing interpretability and safety; (2) we identify key technical gaps hindering practical adoption; and (3) we deliver an actionable, dual-audience guideline—tailored for both researchers and practitioners—to bridge theory and implementation in secure LLM development.

Technology Category

Application Category

📝 Abstract

As large language models (LLMs) see wider real-world use, understanding and mitigating their unsafe behaviors is critical. Interpretation techniques can reveal causes of unsafe outputs and guide safety, but such connections with safety are often overlooked in prior surveys. We present the first survey that bridges this gap, introducing a unified framework that connects safety-focused interpretation methods, the safety enhancements they inform, and the tools that operationalize them. Our novel taxonomy, organized by LLM workflow stages, summarizes nearly 70 works at their intersections. We conclude with open challenges and future directions. This timely survey helps researchers and practitioners navigate key advancements for safer, more interpretable LLMs.

Problem

Research questions and friction points this paper is trying to address.

Understanding and mitigating unsafe behaviors in LLMs

Bridging interpretation techniques with LLM safety enhancements

Providing a unified framework for safety-focused interpretation methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework connects interpretation and safety

Novel taxonomy organizes 70 works by stages

Survey guides safer interpretable LLM advancements

🔎 Similar Papers

No similar papers found.