LLM Zeroth-Order Fine-Tuning is an Inference Workload

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inherent mismatch between zeroth-order fine-tuning methods (e.g., MeZO) and conventional training loops in large language models, revealing for the first time that such approaches are fundamentally inference-intensive. To resolve this inefficiency, the authors reformulate zeroth-order optimization as a serving-style inference computation and introduce a dynamic adapter state representation coupled with a tailored scheduling mechanism. By integrating vLLM runtime support, LoRA/MeZO optimizations, and high-rank factorized updates, their framework achieves highly efficient execution. Experiments demonstrate an 8.13× speedup on OPT-13B (0.51 vs. 4.15 hours) while retaining 92.2% accuracy, with consistent acceleration across model scales ranging from 2.34× to 7.72× and a 2.55× speedup even under high-rank settings.
📝 Abstract
Zeroth-order (ZO) fine-tuning is attractive for large language models because it replaces backpropagation with forward objective evaluations. Existing implementations nevertheless execute ZO algorithms inside conventional training loops, even though their dominant work is repeated scoring under nearby parameter states. This creates a workload-runtime mismatch: the algorithm asks for structured inference-style scoring, while the system exposes a sequence of fragmented training-loop steps. We show that LLM ZO fine-tuning is an inference-dominated workload and execute its repeated scoring phase through a serving runtime. On OPT-13B SST-2, the resulting vLLM execution path completes the 20k-step LoZO run in 0.51 estimated training hours versus 4.15 hours for the official LoZO baseline under the matched LoRA-only setting, an 8.13x speedup, while reaching 0.922 final evaluation accuracy and 0.931 final full-validation accuracy. In core-step scaling experiments across OPT-1.3B to OPT-13B, the same runtime reorganization gives 2.34x--7.72x speedups. A MeZO-style high-rank factorized experiment shows that the same runtime paradigm can track a MeZO-like loss trajectory while running up to 2.55x faster. More broadly, representing ZO updates as dynamic adapter states suggests a practical path toward inference-time training, where lightweight adaptation can be scheduled as an inference-like workload rather than as a separate training job.
Problem

Research questions and friction points this paper is trying to address.

Zeroth-Order Optimization
Large Language Models
Fine-Tuning
Inference Workload
Training-System Mismatch
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zeroth-Order Optimization
Inference-Dominated Workload
vLLM
Inference-Time Training
LoRA