The Straight and Narrow: Do LLMs Possess an Internal Moral Path?

📅 2026-01-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of current large language models, whose moral alignment often relies on superficial constraints and fails to effectively modulate their internal moral representations. Grounded in Moral Foundations Theory, the paper proposes an Adaptive Moral Fusion (AMF) mechanism that leverages cross-lingual linear probing to analyze mid-layer model representations. This approach reveals, for the first time, a shared yet distinct moral subspace between English and Chinese, from which manipulable moral vectors are extracted. During inference, AMF dynamically integrates probe-based moral detection with targeted vector injection to enable real-time intervention in the model’s intrinsic moral reasoning pathways. Experiments demonstrate that the method significantly reduces false rejection rates on benign queries while effectively suppressing jailbreak attack success, outperforming standard baselines.

Technology Category

Application Category

📝 Abstract
Enhancing the moral alignment of Large Language Models (LLMs) is a critical challenge in AI safety. Current alignment techniques often act as superficial guardrails, leaving the intrinsic moral representations of LLMs largely untouched. In this paper, we bridge this gap by leveraging Moral Foundations Theory (MFT) to map and manipulate the fine-grained moral landscape of LLMs. Through cross-lingual linear probing, we validate the shared nature of moral representations in middle layers and uncover a shared yet different moral subspace between English and Chinese. Building upon this, we extract steerable Moral Vectors and successfully validate their efficacy at both internal and behavioral levels. Leveraging the high generalizability of morality, we propose Adaptive Moral Fusion (AMF), a dynamic inference-time intervention that synergizes probe detection with vector injection to tackle the safety-helpfulness trade-off. Empirical results confirm that our approach acts as a targeted intrinsic defense, effectively reducing incorrect refusals on benign queries while minimizing jailbreak success rates compared to standard baselines.
Problem

Research questions and friction points this paper is trying to address.

moral alignment
Large Language Models
AI safety
Moral Foundations Theory
safety-helpfulness trade-off
Innovation

Methods, ideas, or system contributions that make the work stand out.

Moral Foundations Theory
Moral Vectors
Adaptive Moral Fusion
cross-lingual linear probing
intrinsic moral alignment
🔎 Similar Papers
No similar papers found.
L
Luoming Hu
School of Future Technology, Dalian University of Technology, China
J
Jingjie Zeng
School of Computer Science and Technology, Dalian University of Technology, China
Liang Yang
Liang Yang
Dalian University of Technology
NLP
Hongfei Lin
Hongfei Lin
DalianUniversity of Technology
natural language processing,sentimental analysistext miningsocial computing