Coherent Multimodal Reasoning with Iterative Self-Evaluation for Vision-Language Models

📅 2025-08-04

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Existing vision-language models (LVLMs) lack deep, chain-of-thought reasoning capabilities for complex cross-modal commonsense inference, often relying on superficial correlations rather than rigorous multi-step deduction. To address this, we propose CoMIR—a coherent multimodal inference framework with iterative self-evaluation. CoMIR employs a three-module architecture: a reasoning decomposition unit, a contextual inference engine, and a coherence assessment module—enabling problem decomposition, progressive inference, and consistency self-checking. Built upon LLaVA-1.6-34B and trained on our newly constructed MDAR dataset, CoMIR supports multi-step reasoning and confidence-aware self-assessment. On benchmarks including VCR, A-OKVQA, and DailyLife-MRC, it achieves state-of-the-art performance among open-source models, attaining an average accuracy of 69.4%—2.4 percentage points higher than the strongest baseline—while significantly improving logical coherence and accuracy in complex multimodal reasoning tasks.

Technology Category

Application Category

📝 Abstract

Despite significant advancements, current large language models (LLMs) and vision-language models (LVLMs) continue to struggle with complex, multi-step, cross-modal common sense reasoning tasks, often exhibiting a lack of "deliberative thinking." They tend to rely on superficial associations rather than deep, chained inference, particularly when integrating visual information with abstract concepts. To address this, we propose the Coherent Multimodal Reasoning Framework (CMRF), a novel approach that enhances LVLMs' common sense reasoning capabilities through an iterative, self-evaluating inference mechanism. CMRF mimics human problem-solving by decomposing complex queries, generating step-by-step inferences, and self-correcting errors. Our framework integrates three key modules: a Reasoning Decomposition Unit (RDU) for breaking down problems into sub-questions, a Contextual Inference Engine (CIE) for contextual inference, and a Coherence Assessment Module (CAM) for evaluating logical consistency and confidence. Coupled with an Adaptive Iterative Refinement strategy, CMRF systematically refines its reasoning paths. Built upon LLaVA-1.6-34B and trained on a novel Multimodal Daily Activity Reasoning (MDAR) dataset, CMRF achieves state-of-the-art performance among open-source LVLMs on challenging benchmarks like VCR, A-OKVQA, and DailyLife-MRC. It attains an average accuracy of 69.4%, surpassing the best open-source baseline by +2.4 percentage points, with particular strength in complex reasoning scenarios. Extensive ablation studies and human evaluations confirm the critical contributions of each module and the effectiveness of iterative refinement in fostering more coherent and accurate reasoning.

Problem

Research questions and friction points this paper is trying to address.

Enhance LVLMs' common sense reasoning with iterative self-evaluation

Improve cross-modal reasoning for complex, multi-step tasks

Address superficial associations in vision-language model inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative self-evaluating inference mechanism

Three key modules for reasoning decomposition

Adaptive Iterative Refinement strategy

🔎 Similar Papers

No similar papers found.