Empower Vision Applications with LoRA LMM

📅 2024-11-01

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Existing LoRA adapters incur substantial computational overhead and high latency during inference serving, hindering their practical deployment in vision tasks. This paper introduces VaLoRA, an end-to-end system for efficient inference optimization of large vision-language models (LVLMs). Our approach addresses three core challenges: (1) accuracy-aware automatic generation of LoRA adapters; (2) adaptive block-wise batching supporting heterogeneous adapters; and (3) a flexible adapter orchestration mechanism integrating request-level features and domain knowledge. VaLoRA holistically integrates LoRA fine-tuning, dynamic batching, and request–adapter co-scheduling. Evaluated across five representative vision tasks and three mainstream LVLMs, VaLoRA achieves average accuracy gains of 24–62% and end-to-end latency reductions of 20–89% over baseline methods.

Technology Category

Application Category

📝 Abstract

Large Multimodal Models (LMMs) have shown significant progress in various complex vision tasks with the solid linguistic and reasoning capacity inherited from large language models (LMMs). Low-rank adaptation (LoRA) offers a promising method to integrate external knowledge into LMMs, compensating for their limitations on domain-specific tasks. However, the existing LoRA model serving is excessively computationally expensive and causes extremely high latency. In this paper, we present an end-to-end solution that empowers diverse vision tasks and enriches vision applications with LoRA LMMs. Our system, VaLoRA, enables accurate and efficient vision tasks by 1) an accuracy-aware LoRA adapter generation approach that generates LoRA adapters rich in domain-specific knowledge to meet application-specific accuracy requirements, 2) an adaptive-tiling LoRA adapters batching operator that efficiently computes concurrent heterogeneous LoRA adapters, and 3) a flexible LoRA adapter orchestration mechanism that manages application requests and LoRA adapters to achieve the lowest average response latency. We prototype VaLoRA on five popular vision tasks on three LMMs. Experiment results reveal that VaLoRA improves 24-62% of the accuracy compared to the original LMMs and reduces 20-89% of the latency compared to the state-of-the-art LoRA model serving systems.

Problem

Research questions and friction points this paper is trying to address.

Enhance vision tasks with domain-specific LoRA adapters

Reduce computational cost and latency in LoRA model serving

Improve accuracy and efficiency in multimodal vision applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Accuracy-aware LoRA adapter generation for domain-specific tasks

Adaptive-tiling LoRA adapters batching for efficient computation

Flexible LoRA adapter orchestration to minimize response latency

🔎 Similar Papers

No similar papers found.