Empower Vision Applications with LoRA LMM

๐Ÿ“… 2024-11-01
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing LoRA adapters incur substantial computational overhead and high latency during inference serving, hindering their practical deployment in vision tasks. This paper introduces VaLoRA, an end-to-end system for efficient inference optimization of large vision-language models (LVLMs). Our approach addresses three core challenges: (1) accuracy-aware automatic generation of LoRA adapters; (2) adaptive block-wise batching supporting heterogeneous adapters; and (3) a flexible adapter orchestration mechanism integrating request-level features and domain knowledge. VaLoRA holistically integrates LoRA fine-tuning, dynamic batching, and requestโ€“adapter co-scheduling. Evaluated across five representative vision tasks and three mainstream LVLMs, VaLoRA achieves average accuracy gains of 24โ€“62% and end-to-end latency reductions of 20โ€“89% over baseline methods.

Technology Category

Application Category

๐Ÿ“ Abstract
Large Multimodal Models (LMMs) have shown significant progress in various complex vision tasks with the solid linguistic and reasoning capacity inherited from large language models (LMMs). Low-rank adaptation (LoRA) offers a promising method to integrate external knowledge into LMMs, compensating for their limitations on domain-specific tasks. However, the existing LoRA model serving is excessively computationally expensive and causes extremely high latency. In this paper, we present an end-to-end solution that empowers diverse vision tasks and enriches vision applications with LoRA LMMs. Our system, VaLoRA, enables accurate and efficient vision tasks by 1) an accuracy-aware LoRA adapter generation approach that generates LoRA adapters rich in domain-specific knowledge to meet application-specific accuracy requirements, 2) an adaptive-tiling LoRA adapters batching operator that efficiently computes concurrent heterogeneous LoRA adapters, and 3) a flexible LoRA adapter orchestration mechanism that manages application requests and LoRA adapters to achieve the lowest average response latency. We prototype VaLoRA on five popular vision tasks on three LMMs. Experiment results reveal that VaLoRA improves 24-62% of the accuracy compared to the original LMMs and reduces 20-89% of the latency compared to the state-of-the-art LoRA model serving systems.
Problem

Research questions and friction points this paper is trying to address.

Enhance vision tasks with domain-specific LoRA adapters
Reduce computational cost and latency in LoRA model serving
Improve accuracy and efficiency in multimodal vision applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Accuracy-aware LoRA adapter generation for domain-specific tasks
Adaptive-tiling LoRA adapters batching for efficient computation
Flexible LoRA adapter orchestration to minimize response latency
๐Ÿ”Ž Similar Papers
No similar papers found.
L
Liang Mi
State Key Laboratory for Novel Software Technology, Nanjing University
Weijun Wang
Weijun Wang
Tsinghua University
LLM Serving SystemEdge AIVideo Analytics System
W
Wenming Tu
Institute for AI Industry Research (AIR), Tsinghua University
Q
Qingfeng He
Institute for AI Industry Research (AIR), Tsinghua University
R
Rui Kong
Institute for AI Industry Research (AIR), Tsinghua University
X
Xinyu Fang
Institute for AI Industry Research (AIR), Tsinghua University
Y
Yazhu Dong
Institute for AI Industry Research (AIR), Tsinghua University
Y
Yikang Zhang
State Key Laboratory for Novel Software Technology, Nanjing University
Yuanchun Li
Yuanchun Li
Institute for AI Industry Research (AIR), Tsinghua University
mobile computingartificial intelligence
M
Meng Li
State Key Laboratory for Novel Software Technology, Nanjing University
Haipeng Dai
Haipeng Dai
Nanjing University
Wireless sensor networkswireless power transfer
Guihai Chen
Guihai Chen
Professor of Computer Science
Computer Science and Technology
Yunxin Liu
Yunxin Liu
IEEE Fellow, Guoqiang Professor, Institute for AI Industry Research (AIR), Tsinghua University
Mobile ComputingEdge ComputingAIoTSystemNetworking