🤖 AI Summary
Existing vision-language models lack the capability to model moral semantics in images and text. Method: We propose the first vision-language contrastive alignment model grounded in Moral Foundations Theory (MFT), explicitly integrating MFT into multimodal representation learning to enable joint image-text embedding and cross-modal alignment within a unified moral semantic space. Our approach introduces MFT-guided multimodal modeling for the first time; constructs a fine-grained, MFT-annotated dataset of 15,000 image-text pairs; and designs a morality-aware data augmentation strategy alongside a multi-label contrastive learning framework, trained under supervision using the Social-Moral Image Database. Contribution/Results: Experiments demonstrate substantial improvements in both unimodal moral classification and cross-modal moral semantic matching. The model establishes a novel paradigm and technical foundation for interpretable, value-aligned multimodal AI systems.
📝 Abstract
Recent advances in vision-language models have enabled rich semantic understanding across modalities. However, these encoding methods lack the ability to interpret or reason about the moral dimensions of content-a crucial aspect of human cognition. In this paper, we address this gap by introducing MoralCLIP, a novel embedding representation method that extends multimodal learning with explicit moral grounding based on Moral Foundations Theory (MFT). Our approach integrates visual and textual moral cues into a unified embedding space, enabling cross-modal moral alignment. MoralCLIP is grounded on the multi-label dataset Social-Moral Image Database to identify co-occurring moral foundations in visual content. For MoralCLIP training, we design a moral data augmentation strategy to scale our annotated dataset to 15,000 image-text pairs labeled with MFT-aligned dimensions. Our results demonstrate that explicit moral supervision improves both unimodal and multimodal understanding of moral content, establishing a foundation for morally-aware AI systems capable of recognizing and aligning with human moral values.