QLIP: A Dynamic Quadtree Vision Prior Enhances MLLM Performance Without Retraining

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

CLIP visual encoders in multimodal large language models (MLLMs) suffer from three key limitations: fixed input resolution, semantic confusion across heterogeneous images, and the need for full-model retraining upon replacement. To address these, we propose a dynamic quadtree-based image patching mechanism that enables content-adaptive visual tokenization. Our method introduces the first dynamic quadtree visual prior to eliminate mesoscopic-scale distortion and interpolation bias; supports arbitrary-resolution inputs while disentangling embeddings of dissimilar images; and achieves plug-and-play integration via hierarchical token aggregation and a lightweight CLIP replacement architecture. Deployed zero-shot on the full LLaVA-v1.5 series, our approach significantly improves VQA accuracy—achieving up to +13.6% on the V* fine-grained understanding benchmark—while preserving robust coarse- and fine-grained visual comprehension capabilities.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) encode images into visual tokens, aligning visual and textual signals within a shared latent space to facilitate crossmodal representation learning. The CLIP model is a widely adopted foundational vision language model whose vision encoder has played a critical role in the development of MLLMs such as LLaVA. However, the CLIP vision encoder suffers from notable limitations including being constrained to only handling fixed input resolutions and a failure to produce separated embeddings for dissimilar images. Replacing the vision encoder of an existing model typically incurs substantial computational costs because such a change often necessitates retraining the entire model pipeline. In this work, we identify two factors which underlie the limitations of the CLIP vision encoder: mesoscopic bias and interpolation bias. To address these issues, we propose QLIP, a drop-in replacement for CLIP that can be seamlessly integrated with existing MLLMs with only a few lines of code and can enhance both coarse-grained and fine-grained visual understanding, without re-training. QLIP is designed around an image quadtree which replaces the standard uniform grid patches with a novel content aware patchification. Our experimental results demonstrate that QLIP improves the general visual question answering accuracy of the LLaVA v1.5 model series across various model sizes--without requiring retraining or fine-tuning of the full MLLM. Notably, QLIP boosts detailed understanding performance on the challenging $V^{ast}$ benchmark by up to 13.6 percent.

Problem

Research questions and friction points this paper is trying to address.

Overcoming CLIP's fixed input resolution limitations

Eliminating interpolation bias in vision-language models

Enhancing MLLM visual understanding without retraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

QLIP replaces CLIP without retraining MLLMs

Uses quadtree for content-aware image patchification

Enhances both coarse and fine-grained visual understanding

🔎 Similar Papers

Law of Vision Representation in MLLMs