๐ค AI Summary
Existing LoRA adapters incur substantial computational overhead and high latency during inference serving, hindering their practical deployment in vision tasks. This paper introduces VaLoRA, an end-to-end system for efficient inference optimization of large vision-language models (LVLMs). Our approach addresses three core challenges: (1) accuracy-aware automatic generation of LoRA adapters; (2) adaptive block-wise batching supporting heterogeneous adapters; and (3) a flexible adapter orchestration mechanism integrating request-level features and domain knowledge. VaLoRA holistically integrates LoRA fine-tuning, dynamic batching, and requestโadapter co-scheduling. Evaluated across five representative vision tasks and three mainstream LVLMs, VaLoRA achieves average accuracy gains of 24โ62% and end-to-end latency reductions of 20โ89% over baseline methods.
๐ Abstract
Large Multimodal Models (LMMs) have shown significant progress in various complex vision tasks with the solid linguistic and reasoning capacity inherited from large language models (LMMs). Low-rank adaptation (LoRA) offers a promising method to integrate external knowledge into LMMs, compensating for their limitations on domain-specific tasks. However, the existing LoRA model serving is excessively computationally expensive and causes extremely high latency. In this paper, we present an end-to-end solution that empowers diverse vision tasks and enriches vision applications with LoRA LMMs. Our system, VaLoRA, enables accurate and efficient vision tasks by 1) an accuracy-aware LoRA adapter generation approach that generates LoRA adapters rich in domain-specific knowledge to meet application-specific accuracy requirements, 2) an adaptive-tiling LoRA adapters batching operator that efficiently computes concurrent heterogeneous LoRA adapters, and 3) a flexible LoRA adapter orchestration mechanism that manages application requests and LoRA adapters to achieve the lowest average response latency. We prototype VaLoRA on five popular vision tasks on three LMMs. Experiment results reveal that VaLoRA improves 24-62% of the accuracy compared to the original LMMs and reduces 20-89% of the latency compared to the state-of-the-art LoRA model serving systems.