🤖 AI Summary
This work addresses the limitation of current large language models, whose moral alignment often relies on superficial constraints and fails to effectively modulate their internal moral representations. Grounded in Moral Foundations Theory, the paper proposes an Adaptive Moral Fusion (AMF) mechanism that leverages cross-lingual linear probing to analyze mid-layer model representations. This approach reveals, for the first time, a shared yet distinct moral subspace between English and Chinese, from which manipulable moral vectors are extracted. During inference, AMF dynamically integrates probe-based moral detection with targeted vector injection to enable real-time intervention in the model’s intrinsic moral reasoning pathways. Experiments demonstrate that the method significantly reduces false rejection rates on benign queries while effectively suppressing jailbreak attack success, outperforming standard baselines.
📝 Abstract
Enhancing the moral alignment of Large Language Models (LLMs) is a critical challenge in AI safety. Current alignment techniques often act as superficial guardrails, leaving the intrinsic moral representations of LLMs largely untouched. In this paper, we bridge this gap by leveraging Moral Foundations Theory (MFT) to map and manipulate the fine-grained moral landscape of LLMs. Through cross-lingual linear probing, we validate the shared nature of moral representations in middle layers and uncover a shared yet different moral subspace between English and Chinese. Building upon this, we extract steerable Moral Vectors and successfully validate their efficacy at both internal and behavioral levels. Leveraging the high generalizability of morality, we propose Adaptive Moral Fusion (AMF), a dynamic inference-time intervention that synergizes probe detection with vector injection to tackle the safety-helpfulness trade-off. Empirical results confirm that our approach acts as a targeted intrinsic defense, effectively reducing incorrect refusals on benign queries while minimizing jailbreak success rates compared to standard baselines.