MM-SCALE: Grounded Multimodal Moral Reasoning via Scalar Judgment and Listwise Alignment

📅 2026-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that existing vision-language models struggle to perform moral reasoning aligned with human judgment in multimodal and socially ambiguous contexts. To this end, we introduce MM-SCALE, a large-scale multimodal dataset for moral alignment that uniquely incorporates 5-point scalar ratings and an explicit modality alignment mechanism. Using a custom annotation interface, we collect fine-grained moral acceptability scores and justifications for image-scenario pairs. Leveraging this dataset, we replace conventional binary or pairwise supervision with scalar-based supervision and integrate listwise preference optimization with model fine-tuning to achieve more continuous, nuanced, and granular moral alignment. Experimental results demonstrate that our fine-tuned models significantly outperform baselines in both ranking fidelity and moral safety calibration.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) continue to struggle to make morally salient judgments in multimodal and socially ambiguous contexts. Prior works typically rely on binary or pairwise supervision, which often fail to capture the continuous and pluralistic nature of human moral reasoning. We present MM-SCALE (Multimodal Moral Scale), a large-scale dataset for aligning VLMs with human moral preferences through 5-point scalar ratings and explicit modality grounding. Each image-scenario pair is annotated with moral acceptability scores and grounded reasoning labels by humans using an interface we tailored for data collection, enabling listwise preference optimization over ranked scenario sets. By moving from discrete to scalar supervision, our framework provides richer alignment signals and finer calibration of multimodal moral reasoning. Experiments show that VLMs fine-tuned on MM-SCALE achieve higher ranking fidelity and more stable safety calibration than those trained with binary signals.
Problem

Research questions and friction points this paper is trying to address.

multimodal moral reasoning
vision-language models
moral judgment
scalar supervision
social ambiguity
Innovation

Methods, ideas, or system contributions that make the work stand out.

scalar judgment
listwise alignment
multimodal moral reasoning
vision-language models
modality grounding
🔎 Similar Papers
No similar papers found.