Generative Universal Verifier as Multimodal Meta-Reasoner

📅 2025-10-15

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Current vision-language models lack reliable capabilities for verifying and optimizing visual generation outputs, resulting in substantial performance gaps relative to humans on vision-credible reasoning tasks. To address this, we propose OmniVerifier—a generative universal verifier—and introduce ViVerBench, the first benchmark dedicated to visual verification. We train OmniVerifier-7B, a 7-billion-parameter model, grounded in three novel atomic verification capabilities. Furthermore, we devise OmniVerifier-TTS, a serialized test-time scaling paradigm that integrates iterative fine-grained optimization with interleaved world-model reasoning. Experiments demonstrate that OmniVerifier-7B achieves an 8.3-point gain on ViVerBench; OmniVerifier-TTS attains improvements of 3.7 and 4.3 points on T2I-ReasonBench and GenEval++, respectively—significantly outperforming baselines including Best-of-N. Our work establishes foundational methodology and evaluation infrastructure for trustworthy visual reasoning and generation verification.

Technology Category

Application Category

📝 Abstract

We introduce Generative Universal Verifier, a novel concept and plugin designed for next-generation multimodal reasoning in vision-language models and unified multimodal models, providing the fundamental capability of reflection and refinement on visual outcomes during the reasoning and generation process. This work makes three main contributions: (1) We build ViVerBench, a comprehensive benchmark spanning 16 categories of critical tasks for evaluating visual outcomes in multimodal reasoning. Results show that existing VLMs consistently underperform across these tasks, underscoring a substantial gap from human-level capability in reliable visual verification. (2) We design two automated pipelines to construct large-scale visual verification data and train OmniVerifier-7B, the first omni-capable generative verifier trained for universal visual verification and achieves notable gains on ViVerBench(+8.3). Through training, we identify three atomic capabilities in visual verification and demonstrate how they generalize and interact synergistically. (3) We propose OmniVerifier-TTS, a sequential test-time scaling paradigm that leverages the universal verifier to bridge image generation and editing within unified models, enhancing the upper bound of generative ability through iterative fine-grained optimization. Beyond generation, we extend universal verifier to broader world-modeling interleaved reasoning scenarios. Empirically, OmniVerifier-TTS achieves improvements on T2I-ReasonBench(+3.7), and GenEval++(+4.3), outperforming existing parallel test-time scaling methods, such as Best-of-N. By endowing multimodal reasoning with reliable visual verification, OmniVerifier advances both reliable reflection during generation and scalable test-time refinement, marking a step toward more trustworthy and controllable next-generation reasoning systems.

Problem

Research questions and friction points this paper is trying to address.

Improving visual verification capabilities in multimodal reasoning systems

Addressing performance gaps between VLMs and human-level visual verification

Enhancing generative model reliability through iterative test-time refinement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative Universal Verifier enables multimodal reflection

OmniVerifier-7B trains for universal visual verification

Sequential test-time scaling enhances iterative fine-grained optimization

🔎 Similar Papers

Towards Rationality in Language and Multimodal Agents: A Survey

2024-06-01Citations: 6

Bosch Group

bangalore, IN

AI Research Scientist, VLM (vision language models)