π€ AI Summary
Existing multimodal large language models exhibit limitations in visual persuasiveness assessment and trustworthy reasoning, compounded by a lack of effective training and evaluation protocols. This work proposes a supervised fine-tuning approach leveraging diverse teacher-generated rationales to enhance the modelβs capability in predicting visual persuasiveness. Furthermore, it introduces the first three-dimensional evaluation framework for visual persuasive reasoning, assessing rationale quality along the dimensions of consistency, image grounding, and sensitivity. Experimental results demonstrate that fine-tuning with diverse rationales significantly improves prediction performance. Notably, the sensitivity dimension aligns most closely with human preferences, revealing a critical misalignment between prediction accuracy and reasoning trustworthiness. This study thus establishes a scalable supervision paradigm for trustworthy multimodal reasoning.
π Abstract
Despite strong performance of Multimodal Large Language Models (MLLMs) on multimodal tasks, predicting whether and why an image is persuasive remains challenging. We first show that prompting MLLMs to reason before prediction does not consistently help, and can even reduce persuasiveness prediction performance, suggesting that naively generated rationales are unreliable signals for this task. Yet, no established methodology exists for training MLLMs to reason about visual persuasion or evaluating whether their rationales faithfully support their decisions. To address this gap, we show empirically and theoretically that diverse teacher-generated rationales, when used for supervised fine-tuning, improve visual persuasiveness prediction. We further introduce a three-dimensional faithfulness evaluation framework covering rationale-to-decision consistency, rationale-to-image groundedness, and rationale-to-decision sensitivity. Applying this framework shows that prediction performance alone does not guarantee faithful rationales, while rationale-to-decision sensitivity is most aligned with human rationale preferences. These findings motivate faithfulness-aware training objectives and scalable rationale supervision for visual persuasiveness evaluation. Our code and dataset will be made publicly available.