MoralCLIP: Contrastive Alignment of Vision-and-Language Representations with Moral Foundations Theory

📅 2025-06-06

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Existing vision-language models lack the capability to model moral semantics in images and text. Method: We propose the first vision-language contrastive alignment model grounded in Moral Foundations Theory (MFT), explicitly integrating MFT into multimodal representation learning to enable joint image-text embedding and cross-modal alignment within a unified moral semantic space. Our approach introduces MFT-guided multimodal modeling for the first time; constructs a fine-grained, MFT-annotated dataset of 15,000 image-text pairs; and designs a morality-aware data augmentation strategy alongside a multi-label contrastive learning framework, trained under supervision using the Social-Moral Image Database. Contribution/Results: Experiments demonstrate substantial improvements in both unimodal moral classification and cross-modal moral semantic matching. The model establishes a novel paradigm and technical foundation for interpretable, value-aligned multimodal AI systems.

Technology Category

Application Category

📝 Abstract

Recent advances in vision-language models have enabled rich semantic understanding across modalities. However, these encoding methods lack the ability to interpret or reason about the moral dimensions of content-a crucial aspect of human cognition. In this paper, we address this gap by introducing MoralCLIP, a novel embedding representation method that extends multimodal learning with explicit moral grounding based on Moral Foundations Theory (MFT). Our approach integrates visual and textual moral cues into a unified embedding space, enabling cross-modal moral alignment. MoralCLIP is grounded on the multi-label dataset Social-Moral Image Database to identify co-occurring moral foundations in visual content. For MoralCLIP training, we design a moral data augmentation strategy to scale our annotated dataset to 15,000 image-text pairs labeled with MFT-aligned dimensions. Our results demonstrate that explicit moral supervision improves both unimodal and multimodal understanding of moral content, establishing a foundation for morally-aware AI systems capable of recognizing and aligning with human moral values.

Problem

Research questions and friction points this paper is trying to address.

Extends multimodal learning with moral grounding

Aligns visual and textual moral cues

Improves moral content understanding in AI

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends multimodal learning with moral grounding

Integrates visual and textual moral cues

Uses moral data augmentation for training

🔎 Similar Papers

A Survey on Moral Foundation Theory and Pre-Trained Language Models: Current Advances and Challenges