Can MLLMs Reason About Visual Persuasion? Evaluating the Efficacy and Faithfulness of Reasoning

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Existing multimodal large language models exhibit limitations in visual persuasiveness assessment and trustworthy reasoning, compounded by a lack of effective training and evaluation protocols. This work proposes a supervised fine-tuning approach leveraging diverse teacher-generated rationales to enhance the model’s capability in predicting visual persuasiveness. Furthermore, it introduces the first three-dimensional evaluation framework for visual persuasive reasoning, assessing rationale quality along the dimensions of consistency, image grounding, and sensitivity. Experimental results demonstrate that fine-tuning with diverse rationales significantly improves prediction performance. Notably, the sensitivity dimension aligns most closely with human preferences, revealing a critical misalignment between prediction accuracy and reasoning trustworthiness. This study thus establishes a scalable supervision paradigm for trustworthy multimodal reasoning.

📝 Abstract

Despite strong performance of Multimodal Large Language Models (MLLMs) on multimodal tasks, predicting whether and why an image is persuasive remains challenging. We first show that prompting MLLMs to reason before prediction does not consistently help, and can even reduce persuasiveness prediction performance, suggesting that naively generated rationales are unreliable signals for this task. Yet, no established methodology exists for training MLLMs to reason about visual persuasion or evaluating whether their rationales faithfully support their decisions. To address this gap, we show empirically and theoretically that diverse teacher-generated rationales, when used for supervised fine-tuning, improve visual persuasiveness prediction. We further introduce a three-dimensional faithfulness evaluation framework covering rationale-to-decision consistency, rationale-to-image groundedness, and rationale-to-decision sensitivity. Applying this framework shows that prediction performance alone does not guarantee faithful rationales, while rationale-to-decision sensitivity is most aligned with human rationale preferences. These findings motivate faithfulness-aware training objectives and scalable rationale supervision for visual persuasiveness evaluation. Our code and dataset will be made publicly available.

Problem

Research questions and friction points this paper is trying to address.

visual persuasion

multimodal large language models

reasoning faithfulness

rationale evaluation

persuasiveness prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

visual persuasion

multimodal reasoning

faithfulness evaluation