Vision as LoRA

📅 2025-03-26

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

Conventional multimodal large language models (MLLMs) rely on external visual encoders, introducing architectural complexity and high computational overhead. Method: This work proposes a native multimodalization approach for LLMs that eliminates the need for external vision encoders. It internalizes visual capabilities directly into the LLM’s parameter space by injecting vision-specific LoRA adapters, integrating block-wise ViT knowledge distillation, and employing bidirectional attention masking—enabling end-to-end, arbitrary-resolution visual understanding. The method supports full-parameter merging for inference and adaptive high-resolution input processing. Contribution/Results: With only additional pretraining data—and no external encoder—the proposed model achieves performance comparable to conventional encoder-based MLLMs. It significantly reduces structural complexity and inference cost while maintaining strong multimodal reasoning capability. All code, datasets, and pretrained weights are publicly released.

Technology Category

Application Category

📝 Abstract

We introduce Vision as LoRA (VoRA), a novel paradigm for transforming an LLM into an MLLM. Unlike prevalent MLLM architectures that rely on external vision modules for vision encoding, VoRA internalizes visual capabilities by integrating vision-specific LoRA layers directly into the LLM. This design allows the added parameters to be seamlessly merged into the LLM during inference, eliminating structural complexity and minimizing computational overhead. Moreover, inheriting the LLM's ability of handling flexible context, VoRA can process inputs at arbitrary resolutions. To further strengthen VoRA's visual capabilities, we introduce a block-wise distillation method that transfers visual priors from a pre-trained ViT into the LoRA layers, effectively accelerating training by injecting visual knowledge. Additionally, we apply bi-directional attention masks to better capture the context information of an image. We successfully demonstrate that with additional pre-training data, VoRA can perform comparably with conventional encode-based MLLMs. All training data, codes, and model weights will be released at https://github.com/Hon-Wong/VoRA.

Problem

Research questions and friction points this paper is trying to address.

Transform LLM into MLLM using vision-specific LoRA layers

Enhance visual capabilities via block-wise distillation from ViT

Process arbitrary resolution inputs with bi-directional attention masks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates vision-specific LoRA into LLM

Uses block-wise distillation from ViT

Applies bi-directional attention masks

🔎 Similar Papers

Law of Vision Representation in MLLMs