Aligned Probing: Relating Toxic Behavior and Model Internals

📅 2025-03-17

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work investigates the causal relationship between internal representations of language models (LMs) and their toxic output behavior. We propose the first alignment probing framework that enables fine-grained, dynamic alignment between output toxicity and multilayer internal representations, systematically dissecting toxicity encoding mechanisms across 20+ mainstream models (e.g., OLMo, Llama, Mistral). Our analysis reveals that: (i) lower-layer representations strongly encode input-level toxicity, and causal interventions at these layers significantly reduce output toxicity; (ii) toxicity exhibits intrinsic heterogeneity across semantic dimensions—particularly in threat-related contexts. Methodologically, we integrate probe-based interpretability, inter-layer feature attribution, cross-model comparison, causal intervention, and multi-scenario case studies. Key contributions include: (1) a reproducible, quantitative metric for toxicity attribution; and (2) empirical validation of the framework’s practical utility in model debugging, safety alignment, and training-time monitoring.

Technology Category

Application Category

📝 Abstract

We introduce aligned probing, a novel interpretability framework that aligns the behavior of language models (LMs), based on their outputs, and their internal representations (internals). Using this framework, we examine over 20 OLMo, Llama, and Mistral models, bridging behavioral and internal perspectives for toxicity for the first time. Our results show that LMs strongly encode information about the toxicity level of inputs and subsequent outputs, particularly in lower layers. Focusing on how unique LMs differ offers both correlative and causal evidence that they generate less toxic output when strongly encoding information about the input toxicity. We also highlight the heterogeneity of toxicity, as model behavior and internals vary across unique attributes such as Threat. Finally, four case studies analyzing detoxification, multi-prompt evaluations, model quantization, and pre-training dynamics underline the practical impact of aligned probing with further concrete insights. Our findings contribute to a more holistic understanding of LMs, both within and beyond the context of toxicity.

Problem

Research questions and friction points this paper is trying to address.

Aligns language model behavior and internal representations

Examines toxicity encoding in lower layers of models

Explores detoxification and model behavior heterogeneity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Aligned probing links model outputs and internals

Examines 20+ models for toxicity encoding

Case studies show detoxification and pre-training insights

🔎 Similar Papers

Atoxia: Red-teaming Large Language Models with Target Toxic Answers