Native Visual Understanding: Resolving Resolution Dilemmas in Vision-Language Models

📅 2025-06-15

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Visual language models (VLMs) suffer from limited robustness on real-world images with diverse resolutions and aspect ratios due to their reliance on fixed, low-resolution input paradigms; moreover, systematic evaluation benchmarks and open-source training frameworks for resolution robustness remain lacking. To address this, we propose RC-Bench—the first open-source benchmark explicitly designed to evaluate resolution robustness—and NativeRes-LLaVA, the first open-source framework supporting native-resolution training. NativeRes-LLaVA introduces four key innovations: native visual encoding, dynamic image tiling with resampling-based alignment, aspect-ratio-adaptive processing, and multi-scale feature fusion. Extensive evaluation on RC-Bench and multiple resolution-sensitive benchmarks demonstrates significant improvements in fine-grained visual understanding. Our results empirically validate that native-resolution modeling enhances both the robustness and accuracy of VLMs across varying resolutions and aspect ratios.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) face significant challenges when dealing with the diverse resolutions and aspect ratios of real-world images, as most existing models rely on fixed, low-resolution inputs. While recent studies have explored integrating native resolution visual encoding to improve model performance, such efforts remain fragmented and lack a systematic framework within the open-source community. Moreover, existing benchmarks fall short in evaluating VLMs under varied visual conditions, often neglecting resolution as a critical factor. To address the"Resolution Dilemma"stemming from both model design and benchmark limitations, we introduce RC-Bench, a novel benchmark specifically designed to systematically evaluate VLM capabilities under extreme visual conditions, with an emphasis on resolution and aspect ratio variations. In conjunction, we propose NativeRes-LLaVA, an open-source training framework that empowers VLMs to effectively process images at their native resolutions and aspect ratios. Based on RC-Bench and NativeRes-LLaVA, we conduct comprehensive experiments on existing visual encoding strategies. The results show that Native Resolution Visual Encoding significantly improves the performance of VLMs on RC-Bench as well as other resolution-centric benchmarks. Code is available at https://github.com/Niujunbo2002/NativeRes-LLaVA.

Problem

Research questions and friction points this paper is trying to address.

VLMs struggle with diverse image resolutions and aspect ratios

Lack systematic framework for native resolution visual encoding

Existing benchmarks fail to evaluate VLMs under varied visual conditions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Native resolution visual encoding for VLMs

RC-Bench evaluates extreme visual conditions

Open-source NativeRes-LLaVA training framework

🔎 Similar Papers

Pre-trained Vision-Language Models Learn Discoverable Visual Concepts