Refusal Behavior in Large Language Models: A Nonlinear Perspective

📅 2025-01-14

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This study investigates the refusal mechanisms of large language models (LLMs) against harmful requests, challenging the conventional assumption of linear separability in refusal behavior. Method: Leveraging nonlinear dimensionality reduction techniques—including PCA, t-SNE, and UMAP—we systematically analyze hidden states across layers of six LLMs spanning three distinct architectures. Contribution/Results: We empirically demonstrate that refusal decisions cannot be characterized by low-dimensional linear boundaries; instead, they reside on model-specific, nonlinear refusal manifolds, revealing intrinsic nonlinearity, multidimensional heterogeneity, and strong architecture- and layer-dependence. This work establishes “nonlinear interpretability” as a novel paradigm for safety alignment, providing both theoretical foundations and methodological pathways toward interpretable and robust refusal mechanisms.

Technology Category

Application Category

📝 Abstract

Refusal behavior in large language models (LLMs) enables them to decline responding to harmful, unethical, or inappropriate prompts, ensuring alignment with ethical standards. This paper investigates refusal behavior across six LLMs from three architectural families. We challenge the assumption of refusal as a linear phenomenon by employing dimensionality reduction techniques, including PCA, t-SNE, and UMAP. Our results reveal that refusal mechanisms exhibit nonlinear, multidimensional characteristics that vary by model architecture and layer. These findings highlight the need for nonlinear interpretability to improve alignment research and inform safer AI deployment strategies.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Safety

Decision Mechanism

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models

Decision Mechanisms

Advanced Mathematical Techniques

🔎 Similar Papers

No similar papers found.