Enhancing Cognition and Explainability of Multimodal Foundation Models with Self-Synthesized Data

📅 2025-02-19

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Large multimodal models (LMMs) exhibit weak fine-grained visual reasoning capabilities and poor interpretability. Method: This paper proposes a vision-based rejection sampling framework powered by self-synthesized data, driven by expert-defined semantic concepts. It generates verifiable visual features via concept–image alignment, enabling joint optimization of answers and explanations. The framework introduces, for the first time, a reward-model-agnostic self-synthesized data filtering mechanism, integrated with iterative vision-specific fine-tuning and unsupervised answer quality filtering. Results: On specialized visual classification tasks, the method significantly improves both prediction accuracy and explanation plausibility. Generated explanations are grounded in human-verifiable visual evidence, achieving, for the first time, synergistic enhancement of fine-grained cognitive capability and interpretability in LMMs.

Technology Category

Application Category

📝 Abstract

Large multimodal models (LMMs) have shown impressive capabilities in a wide range of visual tasks. However, they often struggle with fine-grained visual reasoning, failing to identify domain-specific objectives and provide justifiable explanations for their predictions. To address this, we propose a novel visual rejection sampling framework to improve the cognition and explainability of LMMs using self-synthesized data. Specifically, visual fine-tuning requires images, queries, and target answers. Our approach begins by synthesizing interpretable answers that include human-verifiable visual features. These features are based on expert-defined concepts, carefully selected based on their alignment with the image content. After each round of fine-tuning, we apply a reward model-free filtering mechanism to select the highest-quality interpretable answers for the next round of tuning. This iterative process of data synthesis and fine-tuning progressively improves the model's ability to generate accurate and reasonable explanations. Experimental results demonstrate the effectiveness of our method in improving both the accuracy and explainability of specialized visual classification tasks.

Problem

Research questions and friction points this paper is trying to address.

Improve LMMs' fine-grained visual reasoning

Enhance explainability with self-synthesized data

Generate accurate, interpretable visual explanations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-synthesized data enhances cognition

Visual rejection sampling improves explainability

Iterative fine-tuning with reward-free filtering

🔎 Similar Papers

No similar papers found.