🤖 AI Summary
This work addresses the challenge of unifying diverse visual perception and reasoning tasks within a single framework. Methodologically, it introduces the first multi-objective cognitive learning strategy and a unified task modeling paradigm, integrating reinforcement learning–driven cognitive policy optimization, vision-language joint modeling, task-semantic reformulation, and multi-task collaborative training. Its core contribution is the first end-to-end, interpretable reasoning model capable of jointly handling ten vision tasks—including detection, segmentation, and counting—while generating structured, human-readable reasoning chains across these domains. Experimental results demonstrate significant improvements over Qwen2.5-VL: +29.1% on COCO (detection), +22.1% on ReasonSeg (reasoning-aware segmentation), and +15.3% on CountBench (counting). To our knowledge, this is the first unified model achieving consistent state-of-the-art performance across all three major subfields of visual perception.
📝 Abstract
Large vision-language models exhibit inherent capabilities to handle diverse visual perception tasks. In this paper, we introduce VisionReasoner, a unified framework capable of reasoning and solving multiple visual perception tasks within a shared model. Specifically, by designing novel multi-object cognitive learning strategies and systematic task reformulation, VisionReasoner enhances its reasoning capabilities to analyze visual inputs, and addresses diverse perception tasks in a unified framework. The model generates a structured reasoning process before delivering the desired outputs responding to user queries. To rigorously assess unified visual perception capabilities, we evaluate VisionReasoner on ten diverse tasks spanning three critical domains: detection, segmentation, and counting. Experimental results show that VisionReasoner achieves superior performance as a unified model, outperforming Qwen2.5VL by relative margins of 29.1% on COCO (detection), 22.1% on ReasonSeg (segmentation), and 15.3% on CountBench (counting).