Enhancing Cognition and Explainability of Multimodal Foundation Models with Self-Synthesized Data

๐Ÿ“… 2025-02-19
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Large multimodal models (LMMs) exhibit weak fine-grained visual reasoning capabilities and poor interpretability. Method: This paper proposes a vision-based rejection sampling framework powered by self-synthesized data, driven by expert-defined semantic concepts. It generates verifiable visual features via conceptโ€“image alignment, enabling joint optimization of answers and explanations. The framework introduces, for the first time, a reward-model-agnostic self-synthesized data filtering mechanism, integrated with iterative vision-specific fine-tuning and unsupervised answer quality filtering. Results: On specialized visual classification tasks, the method significantly improves both prediction accuracy and explanation plausibility. Generated explanations are grounded in human-verifiable visual evidence, achieving, for the first time, synergistic enhancement of fine-grained cognitive capability and interpretability in LMMs.

Technology Category

Application Category

๐Ÿ“ Abstract
Large multimodal models (LMMs) have shown impressive capabilities in a wide range of visual tasks. However, they often struggle with fine-grained visual reasoning, failing to identify domain-specific objectives and provide justifiable explanations for their predictions. To address this, we propose a novel visual rejection sampling framework to improve the cognition and explainability of LMMs using self-synthesized data. Specifically, visual fine-tuning requires images, queries, and target answers. Our approach begins by synthesizing interpretable answers that include human-verifiable visual features. These features are based on expert-defined concepts, carefully selected based on their alignment with the image content. After each round of fine-tuning, we apply a reward model-free filtering mechanism to select the highest-quality interpretable answers for the next round of tuning. This iterative process of data synthesis and fine-tuning progressively improves the model's ability to generate accurate and reasonable explanations. Experimental results demonstrate the effectiveness of our method in improving both the accuracy and explainability of specialized visual classification tasks.
Problem

Research questions and friction points this paper is trying to address.

Improve LMMs' fine-grained visual reasoning
Enhance explainability with self-synthesized data
Generate accurate, interpretable visual explanations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-synthesized data enhances cognition
Visual rejection sampling improves explainability
Iterative fine-tuning with reward-free filtering
๐Ÿ”Ž Similar Papers
No similar papers found.