Understanding the Mechanism of Altruism in Large Language Models

📅 2026-04-21

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

This study investigates the intrinsic computational mechanisms underlying altruistic behavior in large language models. By applying sparse autoencoders to analyze model responses in dictator game scenarios shaped by varying social stances, the work establishes the first interpretable and intervenable neural correlates of altruism. Drawing on dual-process theory, it disentangles the contributions of heuristic (System 1) and deliberative (System 2) features. Experimental results demonstrate that modulating a mere 0.024% of critical features significantly alters allocation behavior, with System 2 features exerting a more direct influence on model outputs. Furthermore, the proposed intervention strategy generalizes effectively across multiple social preference tasks, highlighting its robustness and potential for targeted behavioral control.

Technology Category

Application Category

📝 Abstract

Altruism is fundamental to human societies, fostering cooperation and social cohesion. Recent studies suggest that large language models (LLMs) can display human-like prosocial behavior, but the internal computations that produce such behavior remain poorly understood. We investigate the mechanisms underlying LLM altruism using sparse autoencoders (SAEs). In a standard Dictator Game, minimal-pair prompts that differ only in social stance (generous versus selfish) induce large, economically meaningful shifts in allocations. Leveraging this contrast, we identify a set of SAE features (0.024% of all features across the model's layers) whose activations are strongly associated with the behavioral shift. To interpret these features, we use benchmark tasks motivated by dual-process theories to classify a subset as primarily heuristic (System 1) or primarily deliberative (System 2). Causal interventions validate their functional role: activation patching and continuous steering of this feature direction reliably shift allocation distributions, with System 2 features exerting a more proximal influence on the model's final output than System 1 features. The same steering direction generalizes across multiple social-preference games. Together, these results enhance our understanding of artificial cognition by translating altruistic behaviors into identifiable network states and provide a framework for aligning LLM behavior with human values, thereby informing more transparent and value-aligned deployment.

Problem

Research questions and friction points this paper is trying to address.

altruism

large language models

mechanism

prosocial behavior

artificial cognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

sparse autoencoders

altruism

dual-process theory