VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning

📅 2025-05-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of unifying diverse visual perception and reasoning tasks within a single framework. Methodologically, it introduces the first multi-objective cognitive learning strategy and a unified task modeling paradigm, integrating reinforcement learning–driven cognitive policy optimization, vision-language joint modeling, task-semantic reformulation, and multi-task collaborative training. Its core contribution is the first end-to-end, interpretable reasoning model capable of jointly handling ten vision tasks—including detection, segmentation, and counting—while generating structured, human-readable reasoning chains across these domains. Experimental results demonstrate significant improvements over Qwen2.5-VL: +29.1% on COCO (detection), +22.1% on ReasonSeg (reasoning-aware segmentation), and +15.3% on CountBench (counting). To our knowledge, this is the first unified model achieving consistent state-of-the-art performance across all three major subfields of visual perception.

Technology Category

Application Category

📝 Abstract
Large vision-language models exhibit inherent capabilities to handle diverse visual perception tasks. In this paper, we introduce VisionReasoner, a unified framework capable of reasoning and solving multiple visual perception tasks within a shared model. Specifically, by designing novel multi-object cognitive learning strategies and systematic task reformulation, VisionReasoner enhances its reasoning capabilities to analyze visual inputs, and addresses diverse perception tasks in a unified framework. The model generates a structured reasoning process before delivering the desired outputs responding to user queries. To rigorously assess unified visual perception capabilities, we evaluate VisionReasoner on ten diverse tasks spanning three critical domains: detection, segmentation, and counting. Experimental results show that VisionReasoner achieves superior performance as a unified model, outperforming Qwen2.5VL by relative margins of 29.1% on COCO (detection), 22.1% on ReasonSeg (segmentation), and 15.3% on CountBench (counting).
Problem

Research questions and friction points this paper is trying to address.

Unify visual perception and reasoning in one model
Enhance reasoning for diverse visual perception tasks
Improve performance across detection, segmentation, and counting
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework for visual perception and reasoning
Multi-object cognitive learning strategies
Systematic task reformulation for diverse tasks
🔎 Similar Papers
No similar papers found.