R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

📅 2025-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models exhibit significant limitations in multimodal reasoning capabilities and lack comprehensive, education-stage-spanning benchmarks supporting fine-grained evaluation. To address this, we propose an end-to-end multimodal reasoning framework featuring a novel cross-modal formal reasoning pipeline: it parses images into structured textual representations, enabling deep, synergistic vision–language reasoning. We introduce R1-Onevision-Bench—the first education-tiered multimodal reasoning benchmark spanning K12 to university levels—accompanied by a finely annotated dataset. Our method integrates cross-modal formal representation learning, supervised fine-tuning, and reinforcement learning, coupled with a multi-stage educational alignment evaluation protocol. Extensive experiments demonstrate that our approach achieves state-of-the-art performance across multiple multimodal reasoning benchmarks, significantly outperforming GPT-4o and Qwen2.5-VL.

Technology Category

Application Category

📝 Abstract
Large Language Models have demonstrated remarkable reasoning capability in complex textual tasks. However, multimodal reasoning, which requires integrating visual and textual information, remains a significant challenge. Existing visual-language models often struggle to effectively analyze and reason visual content, resulting in suboptimal performance on complex reasoning tasks. Moreover, the absence of comprehensive benchmarks hinders the accurate assessment of multimodal reasoning capabilities. In this paper, we introduce R1-Onevision, a multimodal reasoning model designed to bridge the gap between visual perception and deep reasoning. To achieve this, we propose a cross-modal reasoning pipeline that transforms images into formal textural representations, enabling precise language-based reasoning. Leveraging this pipeline, we construct the R1-Onevision dataset which provides detailed, step-by-step multimodal reasoning annotations across diverse domains. We further develop the R1-Onevision model through supervised fine-tuning and reinforcement learning to cultivate advanced reasoning and robust generalization abilities. To comprehensively evaluate multimodal reasoning performance across different grades, we introduce R1-Onevision-Bench, a benchmark aligned with human educational stages, covering exams from junior high school to university and beyond. Experimental results show that R1-Onevision achieves state-of-the-art performance, outperforming models such as GPT-4o and Qwen2.5-VL on multiple challenging multimodal reasoning benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Addresses challenges in multimodal reasoning integrating visual and textual information.
Introduces a model to bridge visual perception and deep reasoning.
Develops a benchmark for evaluating multimodal reasoning across educational stages.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal pipeline transforms images to text
R1-Onevision dataset with detailed reasoning annotations
R1-Onevision-Bench benchmark for multimodal evaluation
🔎 Similar Papers
No similar papers found.
Y
Yi Yang
Zhejiang University
Xiaoxuan He
Xiaoxuan He
ZheJiang University
Deep Learning
H
Hongkun Pan
Zhejiang University
X
Xiyan Jiang
Zhejiang University
Y
Yan Deng
Zhejiang University
X
Xingtao Yang
Zhejiang University
H
Haoyu Lu
Renmin University of China
Dacheng Yin
Dacheng Yin
University of Science and Technology of China
speech enhancementrepresentation learningspeech editing
F
Fengyun Rao
WeChat Vision, Tencent Inc.
Minfeng Zhu
Minfeng Zhu
Zhejiang University
VisualisationMath
B
Bo Zhang
Zhejiang University
W
Wei Chen
Zhejiang University