FLoRA: Fused forward-backward adapters for parameter efficient fine-tuning and reducing inference-time latencies of LLMs

๐Ÿ“… 2025-10-28
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the trade-off between accuracy and inference latency in parameter-efficient fine-tuning (PEFT) of large language models (LLMs), this paper proposes Forwardโ€“Backward Adapter (FBA), a novel adapter architecture that integrates LoRA with a parallel adapter structure within Transformer projection layers. FBA employs low-rank decomposition for parameter-efficient updates and simultaneously introduces lightweight adapter modules along both the forward propagation and backward gradient paths, jointly enhancing training stability and inference efficiency. Under identical parameter budgets, FBA consistently outperforms standard LoRA across multiple downstream tasks: it achieves average accuracy gains of 2.1โ€“4.7 percentage points, reduces first-token latency by 18%โ€“32%, and lowers total inference latency by 12%โ€“25%. Notably, this work pioneers the extension of adapter design to the backward pass, establishing a new paradigm for co-optimizing accuracy and efficiency in PEFT.

Technology Category

Application Category

๐Ÿ“ Abstract
As the large language models (LLMs) grow in size each day, efficient training and fine-tuning has never been as important as nowadays. This resulted in the great interest in parameter efficient fine-tuning (PEFT), and effective methods including low-rank adapters (LoRA) has emerged. Although the various PEFT methods have been studied extensively in the recent years, the greater part of the subject remains unexplored with the huge degree of freedom. In this paper, we propose FLoRA, a family of fused forward-backward adapters (FFBA) for parameter-efficient fine-tuning of LLMs on downstream tasks. The FFBA combine ideas from the popular LoRA and parallel adapters to improve the overall fine-tuning accuracies. At the same time, latencies are minimized by fusing the forward and backward adapters into existing projection layers of the base model. Experimental results show that the proposed FFB adapters perform significantly better than the popularly used LoRA in both accuracy and latency for a similar parameter budget.
Problem

Research questions and friction points this paper is trying to address.

Reducing parameter count during LLM fine-tuning
Minimizing inference latency in large language models
Improving accuracy while maintaining parameter efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fused forward-backward adapters for parameter-efficient fine-tuning
Combines LoRA and parallel adapters to improve accuracy
Minimizes latency by fusing adapters into projection layers
๐Ÿ”Ž Similar Papers
No similar papers found.
Dhananjaya Gowda
Dhananjaya Gowda
Samsung Research, AI Center, Korea
AILLMASRNLPSpeech processing
S
Seoha Song
Samsung Research
J
Junhyun Lee
Samsung Research
H
Harshith Goka
Samsung Research