zFLoRA: Zero-Latency Fused Low-Rank Adapters

📅 2025-10-28

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

To address the significant inference latency overhead (up to 2.5× that of the base model) introduced by low-rank adapters (e.g., LoRA) in multi-task deployment of large language models (LLMs), this paper proposes Zero-Delay Fused LoRA (ZDF-LoRA). ZDF-LoRA eliminates inference-time adapter overhead by losslessly fusing adapter parameters into the backbone weights post-training and introducing a lightweight runtime skip mechanism. Evaluated on models ranging from 1B to 7B parameters, ZDF-LoRA outperforms standard LoRA and full fine-tuning across 18 tasks spanning commonsense reasoning, mathematical problem solving, and dialogue summarization. End-to-end latency measurements on NPU and GPU platforms show negligible increase (<0.5%), effectively eliminating adapter-induced inference cost without accuracy degradation—marking the first approach to achieve zero-latency adapter inference while preserving model performance.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly deployed with task-specific adapters catering to multiple downstream applications. In such a scenario, the additional compute associated with these apparently insignificant number of adapter parameters (typically less than 1% of the base model) turns out to be disproportionately significant during inference time (upto 2.5x times that of the base model). In this paper, we propose a new zero-latency fused low-rank adapter (zFLoRA) that introduces zero or negligible latency overhead on top of the base model. Experimental results on LLMs of size 1B, 3B and 7B show that zFLoRA compares favorably against the popular supervised fine-tuning benchmarks including low-rank adapters (LoRA) as well as full fine-tuning (FFT). Experiments are conducted on 18 different tasks across three different categories namely commonsense reasoning, math reasoning and summary-dialogue. Latency measurements made on NPU (Samsung Galaxy S25+) as well as GPU (NVIDIA H100) platforms show that the proposed zFLoRA adapters introduce zero to negligible latency overhead.

Problem

Research questions and friction points this paper is trying to address.

Reducing latency overhead of task-specific adapters in LLMs

Eliminating computational cost from adapter parameters during inference

Maintaining performance while removing inference-time adapter delays

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-latency fused low-rank adapters for LLMs

Eliminates inference overhead via parameter fusion

Maintains performance across reasoning and dialogue tasks

🔎 Similar Papers

Sparse High Rank Adapters