LLMSymGuard: A Symbolic Safety Guardrail Framework Leveraging Interpretable Jailbreak Concepts

📅 2025-08-22

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

Large language models (LLMs) face security risks from jailbreaking attacks that elicit harmful content generation; existing alignment and safety fine-tuning methods exhibit insufficient robustness. Method: We propose LLMSymGuard, a symbolic safety guarding framework based on sparse autoencoders (SAEs). It is the first approach to integrate mechanistic interpretability with symbolic logic, enabling unsupervised extraction of interpretable, jailbreak-relevant semantic concepts directly from internal model representations—and subsequently constructing transparent, formally verifiable logical guardrails. Crucially, LLMSymGuard operates without model fine-tuning, delivering dynamic, real-time safety intervention. Results: It significantly improves defense efficacy across diverse jailbreaking attacks while preserving the model’s original linguistic capabilities. Its core contribution is pioneering an end-to-end “interpretable concept → symbolic rule” safety paradigm that jointly ensures security, transparency, and controllability.

Technology Category

Application Category

📝 Abstract

Large Language Models have found success in a variety of applications; however, their safety remains a matter of concern due to the existence of various types of jailbreaking methods. Despite significant efforts, alignment and safety fine-tuning only provide a certain degree of robustness against jailbreak attacks that covertly mislead LLMs towards the generation of harmful content. This leaves them prone to a number of vulnerabilities, ranging from targeted misuse to accidental profiling of users. This work introduces extbf{LLMSymGuard}, a novel framework that leverages Sparse Autoencoders (SAEs) to identify interpretable concepts within LLM internals associated with different jailbreak themes. By extracting semantically meaningful internal representations, LLMSymGuard enables building symbolic, logical safety guardrails -- offering transparent and robust defenses without sacrificing model capabilities or requiring further fine-tuning. Leveraging advances in mechanistic interpretability of LLMs, our approach demonstrates that LLMs learn human-interpretable concepts from jailbreaks, and provides a foundation for designing more interpretable and logical safeguard measures against attackers. Code will be released upon publication.

Problem

Research questions and friction points this paper is trying to address.

Detects jailbreak concepts in LLMs using interpretable representations

Prevents harmful content generation without model fine-tuning

Provides transparent safety guardrails against covert attacks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages Sparse Autoencoders for interpretable concepts

Extracts semantic internal representations from LLMs

Builds symbolic logical safety guardrails without fine-tuning

🔎 Similar Papers

JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models