Multimodal UNcommonsense: From Odd to Ordinary and Ordinary to Odd

📅 2026-02-02

🏛️ Conference on Empirical Methods in Natural Language Processing

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the challenge that multimodal commonsense reasoning models struggle with visual or contextual deviations from typical scenarios. To this end, the authors introduce MUN, a novel benchmark that systematically evaluates models’ counterintuitive reasoning capabilities in both anomalous and mundane settings. They propose a retrieval-augmented in-context learning framework (R-ICL) equipped with a new Multimodal Ensemble Retriever (MER), which effectively retrieves relevant examples even when images and text exhibit significant inconsistency, thereby transferring reasoning capabilities from large to small models. Experimental results demonstrate that the proposed approach achieves an average performance gain of 8.3% on the MUN benchmark, substantially enhancing model robustness and adaptability in low-frequency, atypical scenarios.

Technology Category

Application Category

📝 Abstract

Commonsense reasoning in multimodal contexts remains a foundational challenge in artificial intelligence. We introduce Multimodal UNcommonsense(MUN), a benchmark designed to evaluate models'ability to handle scenarios that deviate from typical visual or contextual expectations. MUN pairs visual scenes with surprising or unlikely outcomes described in natural language, prompting models to either rationalize seemingly odd images using everyday logic or uncover unexpected interpretations in ordinary scenes. To support this task, we propose a retrieval-based in-context learning (R-ICL) framework that transfers reasoning capabilities from larger models to smaller ones without additional training. Leveraging a novel Multimodal Ensemble Retriever (MER), our method identifies semantically relevant exemplars even when image and text pairs are deliberately discordant. Experiments show an average improvement of 8.3% over baseline ICL methods, highlighting the effectiveness of R-ICL in low-frequency, atypical settings. MUN opens new directions for evaluating and improving visual-language models'robustness and adaptability in real-world, culturally diverse, and non-prototypical scenarios.

Problem

Research questions and friction points this paper is trying to address.

multimodal commonsense reasoning

visual-language models

atypical scenarios

contextual expectations

non-prototypical reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal UNcommonsense

retrieval-based in-context learning

Multimodal Ensemble Retriever