Mechanistic Interpretability of Emotion Inference in Large Language Models

📅 2025-02-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The emotional reasoning mechanisms of large language models (LLMs) remain poorly understood, particularly how they infer human emotions from text. Method: We propose a cognitive appraisal–informed framework integrating functional localization and causal intervention: (1) representational analysis to identify neurons encoding appraisal dimensions (e.g., responsibility attribution, goal congruence); (2) the first fine-grained causal intervention on emotion concepts—direct manipulation of appraisal variables to modulate emotion generation; and (3) cross-architecture validation across diverse model families to assess robustness and psychological interpretability. Results: Emotional representations exhibit strong cross-model generalizability; generated outputs align with cognitive appraisal theory predictions; and intervention significantly enhances controllability and safety in emotionally sensitive contexts. This work establishes a theoretically grounded, mechanistically interpretable approach to emotion modeling in LLMs.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) show promising capabilities in predicting human emotions from text. However, the mechanisms through which these models process emotional stimuli remain largely unexplored. Our study addresses this gap by investigating how autoregressive LLMs infer emotions, showing that emotion representations are functionally localized to specific regions in the model. Our evaluation includes diverse model families and sizes and is supported by robustness checks. We then show that the identified representations are psychologically plausible by drawing on cognitive appraisal theory, a well-established psychological framework positing that emotions emerge from evaluations (appraisals) of environmental stimuli. By causally intervening on construed appraisal concepts, we steer the generation and show that the outputs align with theoretical and intuitive expectations. This work highlights a novel way to causally intervene and precisely shape emotional text generation, potentially benefiting safety and alignment in sensitive affective domains.
Problem

Research questions and friction points this paper is trying to address.

Mechanisms of emotion inference in LLMs
Functional localization of emotion representations
Causal intervention in emotional text generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Localized emotion representations in LLMs
Causal intervention on appraisal concepts
Robustness checks across model families
🔎 Similar Papers
No similar papers found.