PreFT: Prefill-only finetuning for efficient inference

📅 2026-05-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

204K/year
🤖 AI Summary
This work addresses the significant throughput degradation in multi-user personalized parameter-efficient fine-tuning (PEFT) services caused by adapter usage during the decoding phase. The authors propose a novel strategy that applies adapters exclusively during the prefill stage and disables them during decoding. By confining adapter computation to prefill and integrating this approach with the vLLM inference engine, the method enables highly efficient LoRA and ReFT execution, further enhanced by memory optimization and kernel scheduling techniques. This design achieves substantial gains in concurrent inference throughput with negligible task performance loss. Experiments demonstrate a 1.9× throughput improvement on Llama-3.1-70B when serving 512 adapters simultaneously; reinforcement learning task performance remains comparable to standard PEFT, while supervised fine-tuning tasks can effectively recover minor accuracy drops through modest rank increases.
📝 Abstract
Large language models can now be personalised efficiently at scale using parameter efficient finetuning methods (PEFTs), but serving user-specific PEFTs harms throughput, even with specialised kernels and memory management techniques. This is because, theoretically and empirically, a mismatch exists between prefill (processing a large number of tokens at once) and decode (generating a single token autoregressively): the latter has far lower throughput when serving multiple adapters. Rather than optimising performance relative to parameter count, for efficient multi-adapter serving, we instead ought to optimise performance relative to serving throughput. We therefore propose PreFT (Prefill-only Finetuning), wherein we only apply the adapter to prefill tokens and discard it afterwards. PreFT significantly increases throughput with minimal effect on performance. We develop and release an efficient implementation of two prefill-only PEFTs, LoRA and ReFT, on the vLLM inference engine. We first show that serving multi-user PreFTs is more efficient than traditional PEFTs ($1.9\times$ the throughput when serving $512$ adapters on Llama 3.1 70B). Then, we compare the performance of prefill-only vs. all-token adapters on a variety of supervised finetuning and reinforcement learning tasks with LMs at varying scales. On SFT, we observe that the evaluation loss of PreFTs is higher than PEFTs, but can be compensated by increasing rank with nearly no reduction in throughput. On RL, we consistently find that PreFTs approach parity with standard PEFTs. Together, this work validates prefill-only adaptation of LLMs as a more favourable accuracy-throughput tradeoff than existing PEFTs for personalised serving.
Problem

Research questions and friction points this paper is trying to address.

efficient inference
parameter efficient finetuning
throughput
multi-adapter serving
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prefill-only Finetuning
PEFT
throughput optimization
multi-adapter serving
vLLM