🤖 AI Summary
This study investigates the intrinsic computational mechanisms underlying altruistic behavior in large language models. By applying sparse autoencoders to analyze model responses in dictator game scenarios shaped by varying social stances, the work establishes the first interpretable and intervenable neural correlates of altruism. Drawing on dual-process theory, it disentangles the contributions of heuristic (System 1) and deliberative (System 2) features. Experimental results demonstrate that modulating a mere 0.024% of critical features significantly alters allocation behavior, with System 2 features exerting a more direct influence on model outputs. Furthermore, the proposed intervention strategy generalizes effectively across multiple social preference tasks, highlighting its robustness and potential for targeted behavioral control.
📝 Abstract
Altruism is fundamental to human societies, fostering cooperation and social cohesion. Recent studies suggest that large language models (LLMs) can display human-like prosocial behavior, but the internal computations that produce such behavior remain poorly understood. We investigate the mechanisms underlying LLM altruism using sparse autoencoders (SAEs). In a standard Dictator Game, minimal-pair prompts that differ only in social stance (generous versus selfish) induce large, economically meaningful shifts in allocations. Leveraging this contrast, we identify a set of SAE features (0.024% of all features across the model's layers) whose activations are strongly associated with the behavioral shift. To interpret these features, we use benchmark tasks motivated by dual-process theories to classify a subset as primarily heuristic (System 1) or primarily deliberative (System 2). Causal interventions validate their functional role: activation patching and continuous steering of this feature direction reliably shift allocation distributions, with System 2 features exerting a more proximal influence on the model's final output than System 1 features. The same steering direction generalizes across multiple social-preference games. Together, these results enhance our understanding of artificial cognition by translating altruistic behaviors into identifiable network states and provide a framework for aligning LLM behavior with human values, thereby informing more transparent and value-aligned deployment.