๐ค AI Summary
To address the excessive computational overhead in multimodal large language models (MLLMs) caused by repeated invocation of large vision encoders (e.g., ViT) during high-resolution image feature fusion, this paper proposes a lightweight feature enrichment method. The core innovation lies in reformulating feature upsampling as a generative process for high-resolution features, enabling high-fidelity feature enhancement using only a shallow neural networkโwithout modifying or retraining the vision encoder. The method is fully compatible with standard ViT architectures and their outputs. Evaluated on multiple fine-grained visual understanding benchmarks, it maintains or improves accuracy while reducing FLOPs by 1.5ร and significantly accelerating both training and inference. This achieves an effective trade-off between model precision and computational efficiency.
๐ Abstract
The integration of high-resolution image features in modern multimodal large language models has demonstrated significant improvements in fine-grained visual understanding tasks, achieving high performance across multiple benchmarks. Since these features are obtained from large image encoders like ViT, they come with a significant increase in computational costs due to multiple calls to these encoders. In this work, we first develop an intuition for feature upsampling as a natural extension of high-resolution feature generation. Through extensive experiments and ablations, we demonstrate how a shallow feature enricher can achieve competitive results with tremendous reductions in training and inference time as well as computational cost, with upto 1.5x saving in FLOPs.