๐ค AI Summary
To address factual inconsistency in multimodal summarization, this paper proposes the first fine-grained, interpretable dual-path factuality evaluation frameworkโjointly supporting reference-dependent supervised evaluation and reference-free open-scenario assessment. Methodologically, it integrates multimodal alignment modeling, cross-modal factual verification, and explainable score decomposition to enable error localization and natural-language explanation generation. Evaluated across multiple benchmarks, the framework substantially outperforms conventional metrics (e.g., BLEU, ROUGE) and achieves a 32% improvement in correlation with human judgments. The code and dataset are publicly released, establishing a new paradigm and practical toolkit for factuality research in multimodal summarization.
๐ Abstract
Multimodal summarization aims to generate a concise summary based on the input text and image. However, the existing methods potentially suffer from unfactual output. To evaluate the factuality of multimodal summarization models, we propose two fine-grained and explainable evaluation frameworks (FALLACIOUS) for different application scenarios, i.e. reference-based factuality evaluation framework and reference-free factuality evaluation framework. Notably, the reference-free factuality evaluation framework doesn't need ground truth and hence it has a wider application scenario. To evaluate the effectiveness of the proposed frameworks, we compute the correlation between our frameworks and the other metrics. The experimental results show the effectiveness of our proposed method. We will release our code and dataset via github.