OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Current general-purpose foundation models lack reliable, fine-grained multimodal verification mechanisms, hindering their safe and controllable deployment. This work proposes OmniVerifier-M1, a multimodal meta-verification framework that introduces symbolic verification evidence—such as bounding boxes—in place of purely textual explanations. The framework employs a decoupled reinforcement learning strategy to separately optimize binary judgment and meta-verification objectives, integrating rule-driven reward mechanisms with multimodal large language models. This approach substantially improves verification accuracy and interpretability, enables region-level error localization and dynamic self-correction, and further gives rise to the M1-TTS proxy generation system, thereby enhancing the reliability and controllability of multimodal models.

📝 Abstract

Visual outcomes are increasingly central to multimodal large language models, making reliable and fine-grained verification essential for scaling generalist foundation models. In this work, we investigate multimodal meta-verification, which leverages verifier-generated rationales rather than decision-only signals, and explore how to effectively incorporate meta-verification feedback into multimodal verifier training. We identify two key findings. First, symbolic verifier outputs (e.g., bounding boxes) outperform textual explanations as meta-verification rationales, enabling efficient rule-based reinforcement learning rewards while avoiding reliance on model-based rewards from auxiliary judge models. Second, decoupling reinforcement learning objectives for binary judgment and meta-verification substantially outperforms joint reward optimization, due to intrinsic differences in output structure and learning dynamics. Based on these insights, we train OmniVerifier-M1, a generalist visual verifier leveraging symbolic meta-verification and decoupled reinforcement learning. OmniVerifier-M1 provides robust verification and fine-grained error localization, and further enables M1-TTS, a verifier-driven agentic generation system achieving dynamic region-level self-correction. This approach paves the way for more reliable, interpretable, and fine-grained multimodal verification, supporting safer and more controllable foundation model deployment.

Problem

Research questions and friction points this paper is trying to address.

multimodal verification

visual outcomes

meta-verification

fine-grained error localization

foundation models

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal meta-verification

symbolic rationales

decoupled reinforcement learning

error localization

agentic self-correction

🔎 Similar Papers

No similar papers found.