RIV: Recursive Introspection Mask Diffusion Vision Language Model

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Existing masked diffusion vision-language models (MDVLMs) lack self-correction capabilities, rendering them unable to rectify grammatical, orthographic, or logical errors in generated sequences. To address this, we propose a recursive introspection mechanism—the first to enable dynamic error identification and iterative correction *during* MDVLM generation. Our approach comprises two core components: (i) an introspective training paradigm that jointly optimizes the primary generator and a differentiable error detector; and (ii) a cyclic demasking inference framework that implements a closed-loop correction cycle—“generate → diagnose → remask → regenerate”. This breaks away from conventional unidirectional generation. Extensive experiments on multimodal understanding benchmarks—including VQAv2, TextVQA, and ST-VQA—demonstrate substantial improvements over state-of-the-art MDVLMs, validating the effectiveness, robustness, and cross-task generalizability of our self-correcting mechanism.

Technology Category

Application Category

📝 Abstract

Mask Diffusion-based Vision Language Models (MDVLMs) have achieved remarkable progress in multimodal understanding tasks. However, these models are unable to correct errors in generated tokens, meaning they lack self-correction capability. In this paper, we propose Recursive Introspection Mask Diffusion Vision Language Model (RIV), which equips the model with self-correction ability through two novel mechanisms. The first is Introspection Training, where an Introspection Model is introduced to identify errors within generated sequences. Introspection Training enables the model to detect not only grammatical and spelling mistakes, but more importantly, logical errors. The second is Recursive Inference. Beginning with the standard unmasking step, the learned Introspection Model helps to identify errors in the output sequence and remask them. This alternating ($ ext{unmask} ightarrow ext{introspection} ightarrow ext{remask}$) process is repeated recursively until reliable results are obtained. Experimental results on multiple benchmarks demonstrate that the proposed RIV achieves state-of-the-art performance, outperforming most existing MDVLMs.

Problem

Research questions and friction points this paper is trying to address.

Correcting errors in generated tokens for self-improvement

Detecting logical and grammatical mistakes in sequences

Enhancing multimodal understanding through recursive introspection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introspection Model identifies errors in generated sequences

Recursive Inference alternates unmasking introspection and remasking

Model achieves self-correction through recursive error detection

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs