🤖 AI Summary
This work addresses the high computational cost and inefficiency of visual encoding in multimodal large language models when processing high-resolution images. To mitigate this, the authors propose an efficient and tunable visual encoding scheme that replaces global encoding with slice-based local encoding and incorporates an early token compression mechanism in the shallow layers of a Vision Transformer (ViT). This approach substantially reduces computational overhead, achieving a 55.8% reduction in visual encoding FLOPs across tasks including document understanding, OCR, and general visual question answering, while maintaining or even surpassing baseline performance. The study provides the first empirical validation of the effectiveness and synergistic benefits of combining slice-wise encoding with early-stage token compression within ViT architectures.
📝 Abstract
Visual encoding constitutes a major computational bottleneck in Multimodal Large Language Models (MLLMs), especially for high-resolution image inputs. The prevailing practice typically adopts global encoding followed by post-ViT compression. Global encoding produces massive token sequences, while post-ViT compression incurs the full quadratic attention cost of the ViT before any token reduction takes place. In this work, we revisit this convention along two dimensions: the encoding strategy and visual token compression. First, controlled experiments show that slice-based encoding outperforms global encoding across benchmarks, suggesting that preserving local details through sliced views can be more beneficial than applying global attention for fine-grained perception. Second, we introduce intra-ViT early compression, which reduces tokens in shallow ViT layers and substantially lowers visual-encoding FLOPs while preserving downstream performance. By integrating intra-ViT compression into the slice-based encoding framework, we present LLaVA-UHD v4, an efficient and compute-controllable visual encoding scheme tailored for high-resolution inputs. Across a diverse set of benchmarks covering document understanding, OCR, and general VQA, LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% while matching or even surpassing baseline performance. These results suggest that visual-encoding efficiency can be substantially improved without sacrificing downstream performance, providing a practical design direction for efficient high-resolution MLLMs. All model weights and code will be publicly released to support further research.