The Straight and Narrow: Do LLMs Possess an Internal Moral Path?

📅 2026-01-15

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses the limitation of current large language models, whose moral alignment often relies on superficial constraints and fails to effectively modulate their internal moral representations. Grounded in Moral Foundations Theory, the paper proposes an Adaptive Moral Fusion (AMF) mechanism that leverages cross-lingual linear probing to analyze mid-layer model representations. This approach reveals, for the first time, a shared yet distinct moral subspace between English and Chinese, from which manipulable moral vectors are extracted. During inference, AMF dynamically integrates probe-based moral detection with targeted vector injection to enable real-time intervention in the model’s intrinsic moral reasoning pathways. Experiments demonstrate that the method significantly reduces false rejection rates on benign queries while effectively suppressing jailbreak attack success, outperforming standard baselines.

Technology Category

Application Category

📝 Abstract

Enhancing the moral alignment of Large Language Models (LLMs) is a critical challenge in AI safety. Current alignment techniques often act as superficial guardrails, leaving the intrinsic moral representations of LLMs largely untouched. In this paper, we bridge this gap by leveraging Moral Foundations Theory (MFT) to map and manipulate the fine-grained moral landscape of LLMs. Through cross-lingual linear probing, we validate the shared nature of moral representations in middle layers and uncover a shared yet different moral subspace between English and Chinese. Building upon this, we extract steerable Moral Vectors and successfully validate their efficacy at both internal and behavioral levels. Leveraging the high generalizability of morality, we propose Adaptive Moral Fusion (AMF), a dynamic inference-time intervention that synergizes probe detection with vector injection to tackle the safety-helpfulness trade-off. Empirical results confirm that our approach acts as a targeted intrinsic defense, effectively reducing incorrect refusals on benign queries while minimizing jailbreak success rates compared to standard baselines.

Problem

Research questions and friction points this paper is trying to address.

moral alignment

Large Language Models

AI safety

Moral Foundations Theory

safety-helpfulness trade-off

Innovation

Methods, ideas, or system contributions that make the work stand out.

Moral Foundations Theory

Moral Vectors

Adaptive Moral Fusion