Scaling Vision Pre-Training to 4K Resolution

📅 2025-03-25

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

High-resolution visual pretraining incurs prohibitive computational costs and struggles to balance fine-grained detail perception with efficiency. Method: This paper proposes PS3, the first CLIP-style multimodal pretraining framework scalable to 4K resolution. It replaces global image-text contrastive learning with a local-to-local contrastive paradigm, integrating saliency-guided local region sampling, fine-grained contrastive learning, and a multi-scale vision encoder—achieving near-constant training complexity. The model supports dynamic-resolution encoding and compute-scalable inference, enabling “free resolution scaling” and “compute-for-accuracy” trade-offs. Contribution/Results: Evaluated on our newly established 4KPro benchmark, VILA-HD significantly outperforms GPT-4o (+14.5%) and Qwen2-VL (+3.2% with 2.96× speedup). It reduces visual token count by 4.3× compared to AnyRes and S², substantially improving efficiency without sacrificing fidelity.

Technology Category

Application Category

📝 Abstract

High-resolution perception of visual details is crucial for daily tasks. Current vision pre-training, however, is still limited to low resolutions (e.g., 378 x 378 pixels) due to the quadratic cost of processing larger images. We introduce PS3 that scales CLIP-style vision pre-training to 4K resolution with a near-constant cost. Instead of contrastive learning on global image representation, PS3 is pre-trained by selectively processing local regions and contrasting them with local detailed captions, enabling high-resolution representation learning with greatly reduced computational overhead. The pre-trained PS3 is able to both encode the global image at low resolution and selectively process local high-resolution regions based on their saliency or relevance to a text prompt. When applying PS3 to multi-modal LLM (MLLM), the resulting model, named VILA-HD, significantly improves high-resolution visual perception compared to baselines without high-resolution vision pre-training such as AnyRes and S^2 while using up to 4.3x fewer tokens. PS3 also unlocks appealing scaling properties of VILA-HD, including scaling up resolution for free and scaling up test-time compute for better performance. Compared to state of the arts, VILA-HD outperforms previous MLLMs such as NVILA and Qwen2-VL across multiple benchmarks and achieves better efficiency than latest token pruning approaches. Finally, we find current benchmarks do not require 4K-resolution perception, which motivates us to propose 4KPro, a new benchmark of image QA at 4K resolution, on which VILA-HD outperforms all previous MLLMs, including a 14.5% improvement over GPT-4o, and a 3.2% improvement and 2.96x speedup over Qwen2-VL.

Problem

Research questions and friction points this paper is trying to address.

Scaling vision pre-training to 4K resolution efficiently

Reducing computational cost for high-resolution image processing

Improving multi-modal LLM performance with high-resolution perception

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scales CLIP-style pre-training to 4K resolution

Selectively processes local regions for efficiency

Enables high-resolution perception with fewer tokens

🔎 Similar Papers

No similar papers found.