🤖 AI Summary
To address the inference overhead in serving multiple LoRA adapters concurrently on foundational large language models—caused by adapter switching and inter-device communication during weight sharding—this paper proposes Block-LoRA. It constrains LoRA weights to a block-diagonal structure, naturally aligning with tensor-parallel device partitioning and thereby eliminating the cross-device synchronization overhead inherent in S-LoRA. This design enables communication-free, load-balanced distributed LoRA computation while preserving LoRA’s parameter efficiency. Integrated with optimized memory management and model sharding strategies, Block-LoRA achieves end-to-end inference speedups of 1.79× and 1.63× on Llama-3.1-70B and Llama-3.1-8B, respectively, when deployed across 8×A100 GPUs. Notably, these gains are attained without increasing (and sometimes even reducing) the total number of adapter parameters.
📝 Abstract
When serving a single base LLM with several different LoRA adapters simultaneously, the adapters cannot simply be merged with the base model's weights as the adapter swapping would create overhead and requests using different adapters could not be batched. Rather, the LoRA computations have to be separated from the base LLM computations, and in a multi-device setup the LoRA adapters can be sharded in a way that is well aligned with the base model's tensor parallel execution, as proposed in S-LoRA. However, the S-LoRA sharding strategy encounters some communication overhead, which may be small in theory, but can be large in practice. In this paper, we propose to constrain certain LoRA factors to be block-diagonal, which allows for an alternative way of sharding LoRA adapters that does not require any additional communication for the LoRA computations. We demonstrate in extensive experiments that our block-diagonal LoRA approach is similarly parameter efficient as standard LoRA (i.e., for a similar number of parameters it achieves similar downstream performance) and that it leads to significant end-to-end speed-up over S-LoRA. For example, when serving on eight A100 GPUs, we observe up to 1.79x (1.23x) end-to-end speed-up with 0.87x (1.74x) the number of adapter parameters for Llama-3.1-70B, and up to 1.63x (1.3x) end-to-end speed-up with 0.86x (1.73x) the number of adapter parameters for Llama-3.1-8B.