Social Norm Reasoning in Multimodal Language Models: An Evaluation

📅 2026-03-03

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

This study addresses the gap in understanding how multimodal large language models (MLLMs) reason about social norms in complex, real-world scenarios that integrate both textual and visual information. While existing normative reasoning approaches predominantly rely on symbolic logic and struggle with multimodal social contexts, the norm comprehension capabilities of MLLMs remain underexplored. The authors present the first systematic evaluation of five leading MLLMs—including GPT-4o and Qwen-2.5VL—on a dataset of 30 textual and 30 visual social stories, benchmarking their performance against human judgments. Results reveal that all models perform significantly better on textual than visual norm inference, with GPT-4o achieving the strongest overall results and Qwen-2.5VL emerging as the top-performing open-source model. Nevertheless, substantial limitations persist across all models when handling nuanced or complex social norms, highlighting both the promise and the challenges of current MLLMs in multimodal normative reasoning.

Technology Category

Application Category

📝 Abstract

In Multi-Agent Systems (MAS), agents are designed with social capabilities, allowing them to understand and reason about social concepts such as norms when interacting with others (e.g., inter-robot interactions). In Normative MAS (NorMAS), researchers study how norms develop, and how violations are detected and sanctioned. However, existing research in NorMAS use symbolic approaches (e.g., formal logic) for norm representation and reasoning whose application is limited to simplified environments. In contrast, Multimodal Large Language Models (MLLMs) present promising possibilities to develop software used by robots to identify and reason about norms in a wide variety of complex social situations embodied in text and images. However, prior work on norm reasoning have been limited to text-based scenarios. This paper investigates the norm reasoning competence of five MLLMs by evaluating their ability to answer norm-related questions based on thirty text-based and thirty image-based stories, and comparing their responses against humans. Our results show that MLLMs demonstrate superior performance in norm reasoning in text than in images. GPT-4o performs the best in both modalities offering the most promise for integration with MAS, followed by the free model Qwen-2.5VL. Additionally, all models find reasoning about complex norms challenging.

Problem

Research questions and friction points this paper is trying to address.

Social Norm Reasoning

Multimodal Language Models

Multi-Agent Systems

Normative MAS

Complex Norms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Large Language Models

Social Norm Reasoning

Normative Multi-Agent Systems