ChatRex: Taming Multimodal LLM for Joint Perception and Understanding

📅 2024-11-27
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) excel at high-level visual understanding but suffer from significant limitations in fine-grained perception—particularly object localization—e.g., Qwen2-VL achieves only 43.9% recall on COCO. To address this, we propose Rexverse, the first framework to introduce a **retrieval-based detection paradigm**, wherein the LLM outputs only candidate bounding-box indices—not raw coordinates—thereby decoupling perception and reasoning modules. We design a Universal Proposal Network (UPN) and a two-stage joint training strategy, and develop Rexverse-2M, an automated, multi-granularity instruction data engine. Evaluated on COCO, Rexverse substantially improves recall while preserving strong semantic comprehension, enabling unified modeling of localization and reasoning. Our work establishes a novel paradigm for perception-understanding co-design in MLLMs.

Technology Category

Application Category

📝 Abstract
Perception and understanding are two pillars of computer vision. While multimodal large language models (MLLM) have demonstrated remarkable visual understanding capabilities, they arguably lack accurate perception abilities, e.g. the stage-of-the-art model Qwen2-VL only achieves a 43.9 recall rate on the COCO dataset, limiting many tasks requiring the combination of perception and understanding. In this work, we aim to bridge this perception gap from both model designing and data development perspectives. We first introduce ChatRex, an MLLM with a decoupled perception design. Instead of having the LLM directly predict box coordinates, we feed the output boxes from a universal proposal network into the LLM, allowing it to output the corresponding box indices to represent its detection results, turning the regression task into a retrieval-based task that LLM handles more proficiently. From the data perspective, we build a fully automated data engine and construct the Rexverse-2M dataset which possesses multiple granularities to support the joint training of perception and understanding. After standard two-stage training, ChatRex demonstrates strong perception capabilities while preserving multimodal understanding performance. The combination of these two capabilities simultaneously unlocks many attractive applications, demonstrating the complementary roles of both perception and understanding in MLLM. Code is available at url{https://github.com/IDEA-Research/ChatRex}.
Problem

Research questions and friction points this paper is trying to address.

Bridges perception gap in multimodal LLMs
Enhances joint perception and understanding capabilities
Develops ChatRex with decoupled perception design
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled perception design in MLLM
Automated data engine for dataset creation
Three-stage training for enhanced performance
🔎 Similar Papers
No similar papers found.