🤖 AI Summary
Existing vision-language models (VLMs) employ LoRA fine-tuning with fixed ranks, limiting adaptability across diverse multimodal tasks while compromising efficiency. To address this, we propose LangVision-LoRA-NAS—a novel framework that introduces neural architecture search (NAS) for automated, task-aware rank configuration in LoRA adaptation. It jointly optimizes rank distributions and module connectivity over a unified vision-language backbone—integrating Vision Transformers with large language models (e.g., LLaMA-3.2-11B)—enabling adaptive, variable-rank structures for both multimodal understanding and generation. Experiments demonstrate that our method significantly outperforms fixed-rank LoRA on downstream tasks while preserving parameter efficiency and reducing fine-tuning computational overhead. The framework achieves superior performance–efficiency trade-offs across heterogeneous vision-language benchmarks. All code and adapted models are publicly released.
📝 Abstract
Vision Language Models (VLMs) integrate visual and text modalities to enable multimodal understanding and generation. These models typically combine a Vision Transformer (ViT) as an image encoder and a Large Language Model (LLM) for text generation. LoRA (Low-Rank Adaptation) is an efficient fine-tuning method to adapt pre-trained models to new tasks by introducing low-rank updates to their weights. While LoRA has emerged as a powerful technique for fine-tuning large models by introducing low-rank updates, current implementations assume a fixed rank, potentially limiting flexibility and efficiency across diverse tasks. This paper introduces extit{LangVision-LoRA-NAS}, a novel framework that integrates Neural Architecture Search (NAS) with LoRA to optimize VLMs for variable-rank adaptation. Our approach leverages NAS to dynamically search for the optimal LoRA rank configuration tailored to specific multimodal tasks, balancing performance and computational efficiency. Through extensive experiments using the LLaMA-3.2-11B model on several datasets, LangVision-LoRA-NAS demonstrates notable improvement in model performance while reducing fine-tuning costs. Our Base and searched fine-tuned models on LLaMA-3.2-11B-Vision-Instruct can be found href{https://huggingface.co/collections/krishnateja95/llama-32-11b-vision-instruct-langvision-lora-nas-6786cac480357a6a6fcc59ee}{ extcolor{blue}{here}} and the code for LangVision-LoRA-NAS can be found href{https://github.com/krishnateja95/LangVision-NAS}{ extcolor{blue}{here}}.