Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning

📅 2026-01-14

📈 Citations: 2

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work proposes a unified generative paradigm for multimodal reasoning that overcomes the limitations of existing approaches, which rely on task-specific inference patterns and struggle to generalize across diverse reasoning tasks. The framework integrates multiple reasoning capabilities by generating intermediate images during inference. Its key innovations include Omni-R1-Zero, the first bootstrapped visual reasoning framework that operates without multimodal annotations, and a two-stage training strategy combining supervised fine-tuning (SFT) with reinforcement learning (RL), leveraging both perceptual alignment loss and perceptual rewards to enable functional image generation. Experimental results demonstrate that Omni-R1 achieves consistent reasoning performance across a variety of tasks, while Omni-R1-Zero matches or even surpasses supervised models despite operating in a completely unsupervised setting.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) are making significant progress in multimodal reasoning. Early approaches focus on pure text-based reasoning. More recent studies have incorporated multimodal information into the reasoning steps; however, they often follow a single task-specific reasoning pattern, which limits their generalizability across various multimodal tasks. In fact, there are numerous multimodal tasks requiring diverse reasoning skills, such as zooming in on a specific region or marking an object within an image. To address this, we propose unified generative multimodal reasoning, which unifies diverse multimodal reasoning skills by generating intermediate images during the reasoning process. We instantiate this paradigm with Omni-R1, a two-stage SFT+RL framework featuring perception alignment loss and perception reward, thereby enabling functional image generation. Additionally, we introduce Omni-R1-Zero, which eliminates the need for multimodal annotations by bootstrapping step-wise visualizations from text-only reasoning data. Empirical results show that Omni-R1 achieves unified generative reasoning across a wide range of multimodal tasks, and Omni-R1-Zero can match or even surpass Omni-R1 on average, suggesting a promising direction for generative multimodal reasoning.

Problem

Research questions and friction points this paper is trying to address.

multimodal reasoning

generalizability

reasoning paradigm

multimodal tasks

diverse reasoning skills

Innovation

Methods, ideas, or system contributions that make the work stand out.

unified generative reasoning

multimodal large language models

intermediate image generation