🤖 AI Summary
Multimodal large language models (MLLMs) excel at high-level visual understanding but suffer from significant limitations in fine-grained perception—particularly object localization—e.g., Qwen2-VL achieves only 43.9% recall on COCO. To address this, we propose Rexverse, the first framework to introduce a **retrieval-based detection paradigm**, wherein the LLM outputs only candidate bounding-box indices—not raw coordinates—thereby decoupling perception and reasoning modules. We design a Universal Proposal Network (UPN) and a two-stage joint training strategy, and develop Rexverse-2M, an automated, multi-granularity instruction data engine. Evaluated on COCO, Rexverse substantially improves recall while preserving strong semantic comprehension, enabling unified modeling of localization and reasoning. Our work establishes a novel paradigm for perception-understanding co-design in MLLMs.
📝 Abstract
Perception and understanding are two pillars of computer vision. While multimodal large language models (MLLM) have demonstrated remarkable visual understanding capabilities, they arguably lack accurate perception abilities, e.g. the stage-of-the-art model Qwen2-VL only achieves a 43.9 recall rate on the COCO dataset, limiting many tasks requiring the combination of perception and understanding. In this work, we aim to bridge this perception gap from both model designing and data development perspectives. We first introduce ChatRex, an MLLM with a decoupled perception design. Instead of having the LLM directly predict box coordinates, we feed the output boxes from a universal proposal network into the LLM, allowing it to output the corresponding box indices to represent its detection results, turning the regression task into a retrieval-based task that LLM handles more proficiently. From the data perspective, we build a fully automated data engine and construct the Rexverse-2M dataset which possesses multiple granularities to support the joint training of perception and understanding. After standard two-stage training, ChatRex demonstrates strong perception capabilities while preserving multimodal understanding performance. The combination of these two capabilities simultaneously unlocks many attractive applications, demonstrating the complementary roles of both perception and understanding in MLLM. Code is available at url{https://github.com/IDEA-Research/ChatRex}.