Block-Diagonal LoRA for Eliminating Communication Overhead in Tensor Parallel LoRA Serving

📅 2025-10-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the inference overhead in serving multiple LoRA adapters concurrently on foundational large language models—caused by adapter switching and inter-device communication during weight sharding—this paper proposes Block-LoRA. It constrains LoRA weights to a block-diagonal structure, naturally aligning with tensor-parallel device partitioning and thereby eliminating the cross-device synchronization overhead inherent in S-LoRA. This design enables communication-free, load-balanced distributed LoRA computation while preserving LoRA’s parameter efficiency. Integrated with optimized memory management and model sharding strategies, Block-LoRA achieves end-to-end inference speedups of 1.79× and 1.63× on Llama-3.1-70B and Llama-3.1-8B, respectively, when deployed across 8×A100 GPUs. Notably, these gains are attained without increasing (and sometimes even reducing) the total number of adapter parameters.

Technology Category

Application Category

📝 Abstract
When serving a single base LLM with several different LoRA adapters simultaneously, the adapters cannot simply be merged with the base model's weights as the adapter swapping would create overhead and requests using different adapters could not be batched. Rather, the LoRA computations have to be separated from the base LLM computations, and in a multi-device setup the LoRA adapters can be sharded in a way that is well aligned with the base model's tensor parallel execution, as proposed in S-LoRA. However, the S-LoRA sharding strategy encounters some communication overhead, which may be small in theory, but can be large in practice. In this paper, we propose to constrain certain LoRA factors to be block-diagonal, which allows for an alternative way of sharding LoRA adapters that does not require any additional communication for the LoRA computations. We demonstrate in extensive experiments that our block-diagonal LoRA approach is similarly parameter efficient as standard LoRA (i.e., for a similar number of parameters it achieves similar downstream performance) and that it leads to significant end-to-end speed-up over S-LoRA. For example, when serving on eight A100 GPUs, we observe up to 1.79x (1.23x) end-to-end speed-up with 0.87x (1.74x) the number of adapter parameters for Llama-3.1-70B, and up to 1.63x (1.3x) end-to-end speed-up with 0.86x (1.73x) the number of adapter parameters for Llama-3.1-8B.
Problem

Research questions and friction points this paper is trying to address.

Eliminating communication overhead in tensor parallel LoRA serving
Enabling efficient batching for requests using different LoRA adapters
Maintaining parameter efficiency while accelerating multi-GPU LoRA serving
Innovation

Methods, ideas, or system contributions that make the work stand out.

Block-diagonal LoRA eliminates communication overhead
Sharding adapters aligned with tensor parallel execution
Maintains parameter efficiency while accelerating serving speed
🔎 Similar Papers
No similar papers found.
X
Xinyu Wang
University of Warwick, Coventry, UK
Jonas M. Kübler
Jonas M. Kübler
Amazon Web Services
K
Kailash Budhathoki
Amazon Web Services, Tübingen, Germany
Y
Yida Wang
Amazon Web Services, Santa Clara, USA
Matthäus Kleindessner
Matthäus Kleindessner
Amazon AWS Tübingen
Machine learning