Tracing Moral Foundations in Large Language Models

📅 2026-01-09

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This study investigates whether moral judgments in large language models arise from internal representations or surface-level mimicry, and elucidates the encoding mechanisms underlying moral foundations. Grounded in Moral Foundations Theory, the work integrates hierarchical representational analysis, pretrained sparse autoencoders (SAEs), and causal interventions on residual streams to achieve, for the first time, causal manipulation of moral concepts at both dense vector and sparse feature levels. Results demonstrate that Llama-3.1-8B-Instruct and Qwen2.5-7B-Instruct structurally differentiate moral foundations in a manner consistent with human moral cognition. Sparse features exhibit strong associations with specific moral semantics, and targeted interventions predictably alter model outputs, revealing that moral representations are hierarchical, distributed, and partially disentangled.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) often produce human-like moral judgments, but it is unclear whether this reflects an internal conceptual structure or superficial ``moral mimicry.''Using Moral Foundations Theory (MFT) as an analytic framework, we study how moral foundations are encoded, organized, and expressed within two instruction-tuned LLMs: Llama-3.1-8B-Instruct and Qwen2.5-7B-Instruct. We employ a multi-level approach combining (i) layer-wise analysis of MFT concept representations and their alignment with human moral perceptions, (ii) pretrained sparse autoencoders (SAEs) over the residual stream to identify sparse features that support moral concepts, and (iii) causal steering interventions using dense MFT vectors and sparse SAE features. We find that both models represent and distinguish moral foundations in a structured, layer-dependent way that aligns with human judgments. At a finer scale, SAE features show clear semantic links to specific foundations, suggesting partially disentangled mechanisms within shared representations. Finally, steering along either dense vectors or sparse features produces predictable shifts in foundation-relevant behavior, demonstrating a causal connection between internal representations and moral outputs. Together, our results provide mechanistic evidence that moral concepts in LLMs are distributed, layered, and partly disentangled, suggesting that pluralistic moral structure can emerge as a latent pattern from the statistical regularities of language alone.

Problem

Research questions and friction points this paper is trying to address.

moral foundations

large language models

moral mimicry

conceptual structure

Moral Foundations Theory

Innovation

Methods, ideas, or system contributions that make the work stand out.

Moral Foundations Theory

Sparse Autoencoders

Causal Steering