Refusal Behavior in Large Language Models: A Nonlinear Perspective

📅 2025-01-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the refusal mechanisms of large language models (LLMs) against harmful requests, challenging the conventional assumption of linear separability in refusal behavior. Method: Leveraging nonlinear dimensionality reduction techniques—including PCA, t-SNE, and UMAP—we systematically analyze hidden states across layers of six LLMs spanning three distinct architectures. Contribution/Results: We empirically demonstrate that refusal decisions cannot be characterized by low-dimensional linear boundaries; instead, they reside on model-specific, nonlinear refusal manifolds, revealing intrinsic nonlinearity, multidimensional heterogeneity, and strong architecture- and layer-dependence. This work establishes “nonlinear interpretability” as a novel paradigm for safety alignment, providing both theoretical foundations and methodological pathways toward interpretable and robust refusal mechanisms.

Technology Category

Application Category

📝 Abstract
Refusal behavior in large language models (LLMs) enables them to decline responding to harmful, unethical, or inappropriate prompts, ensuring alignment with ethical standards. This paper investigates refusal behavior across six LLMs from three architectural families. We challenge the assumption of refusal as a linear phenomenon by employing dimensionality reduction techniques, including PCA, t-SNE, and UMAP. Our results reveal that refusal mechanisms exhibit nonlinear, multidimensional characteristics that vary by model architecture and layer. These findings highlight the need for nonlinear interpretability to improve alignment research and inform safer AI deployment strategies.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Safety
Decision Mechanism
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models
Decision Mechanisms
Advanced Mathematical Techniques
🔎 Similar Papers
No similar papers found.