Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring

📅 2024-03-14
🏛️ arXiv.org
📈 Citations: 16
Influential: 0
📄 PDF
🤖 AI Summary
Large vision-language models (VLMs) suffer from coarse-grained object localization and imprecise cross-modal referring under high-density scenes due to image resolution constraints, limiting their effectiveness in GUI agents, visual counting, and other fine-grained tasks. To address this, we propose the first general-purpose VLM supporting high-resolution input and joint image-text referring. Our method introduces a lightweight downsampling projector that preserves pixel-level fidelity while significantly reducing computational overhead, and a novel plug-and-play visual tokenizer that unifies multimodal referring expressions—including image regions, free-form text, and coordinate-based representations—within a single framework. Through high-resolution image encoding, joint referring pretraining, and multimodal alignment fine-tuning, our model achieves state-of-the-art performance on referring expression comprehension (REC), phrase grounding, and referring expression generation (REG). Moreover, it surpasses specialized expert models on object detection and visual counting benchmarks.

Technology Category

Application Category

📝 Abstract
Large Vision Language Models have achieved fine-grained object perception, but the limitation of image resolution remains a significant obstacle to surpass the performance of task-specific experts in complex and dense scenarios. Such limitation further restricts the model's potential to achieve nuanced visual and language referring in domains such as GUI Agents, Counting and etc. To address this issue, we introduce a unified high-resolution generalist model, Griffon v2, enabling flexible object referring with visual and textual prompts. To efficiently scaling up image resolution, we design a simple and lightweight down-sampling projector to overcome the input tokens constraint in Large Language Models. This design inherently preserves the complete contexts and fine details, and significantly improves multimodal perception ability especially for small objects. Building upon this, we further equip the model with visual-language co-referring capabilities through a plug-and-play visual tokenizer. It enables user-friendly interaction with flexible target images, free-form texts and even coordinates. Experiments demonstrate that Griffon v2 can localize any objects of interest with visual and textual referring, achieve state-of-the-art performance on REC, phrase grounding, and REG tasks, and outperform expert models in object detection and object counting. Data, codes and models will be released at https://github.com/jefferyZhan/Griffon.
Problem

Research questions and friction points this paper is trying to address.

Overcoming image resolution limits in multimodal perception
Enabling flexible object referring with visual-textual prompts
Improving small object detection in complex scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

High-resolution scaling with lightweight down-sampling projector
Visual-language co-referring via plug-and-play tokenizer
Unified model for flexible object referring
Yufei Zhan
Yufei Zhan
Institute of Automation, Chinese Academy of Science
Computer VisionLarge Multimodal ModelsGrounding and Detection
Yousong Zhu
Yousong Zhu
Associate Professor, Chinese Academy of Sciences, Institute of Automation
Multimodal Large Language ModelsSelf-supervised LearningObject Detection
H
Hongyin Zhao
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China
F
Fan Yang
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China
M
Ming Tang
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China
J
Jinqiao Wang
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China