🤖 AI Summary
Visual language models (VLMs) suffer from limited robustness on real-world images with diverse resolutions and aspect ratios due to their reliance on fixed, low-resolution input paradigms; moreover, systematic evaluation benchmarks and open-source training frameworks for resolution robustness remain lacking. To address this, we propose RC-Bench—the first open-source benchmark explicitly designed to evaluate resolution robustness—and NativeRes-LLaVA, the first open-source framework supporting native-resolution training. NativeRes-LLaVA introduces four key innovations: native visual encoding, dynamic image tiling with resampling-based alignment, aspect-ratio-adaptive processing, and multi-scale feature fusion. Extensive evaluation on RC-Bench and multiple resolution-sensitive benchmarks demonstrates significant improvements in fine-grained visual understanding. Our results empirically validate that native-resolution modeling enhances both the robustness and accuracy of VLMs across varying resolutions and aspect ratios.
📝 Abstract
Vision-Language Models (VLMs) face significant challenges when dealing with the diverse resolutions and aspect ratios of real-world images, as most existing models rely on fixed, low-resolution inputs. While recent studies have explored integrating native resolution visual encoding to improve model performance, such efforts remain fragmented and lack a systematic framework within the open-source community. Moreover, existing benchmarks fall short in evaluating VLMs under varied visual conditions, often neglecting resolution as a critical factor. To address the"Resolution Dilemma"stemming from both model design and benchmark limitations, we introduce RC-Bench, a novel benchmark specifically designed to systematically evaluate VLM capabilities under extreme visual conditions, with an emphasis on resolution and aspect ratio variations. In conjunction, we propose NativeRes-LLaVA, an open-source training framework that empowers VLMs to effectively process images at their native resolutions and aspect ratios. Based on RC-Bench and NativeRes-LLaVA, we conduct comprehensive experiments on existing visual encoding strategies. The results show that Native Resolution Visual Encoding significantly improves the performance of VLMs on RC-Bench as well as other resolution-centric benchmarks. Code is available at https://github.com/Niujunbo2002/NativeRes-LLaVA.