Fine-tuning Quantized Neural Networks with Zeroth-order Optimization

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address severe GPU memory constraints in fine-tuning large language models (LLMs) and diffusion models, this paper proposes Quantized Zeroth-Order optimization (QZO), an ultra-low-memory fine-tuning framework that relies solely on forward passes—eliminating the need to store gradients or optimizer states. QZO estimates directional derivatives by perturbing continuous quantization scaling factors (rather than discrete weights) and incorporates directional derivative clipping to ensure training stability. It is orthogonal to and compatible with existing post-training quantization schemes. Under 4-bit weight quantization (int4/bfloat16 mixed precision), QZO reduces total GPU memory consumption by over 18× compared to bfloat16 full-parameter fine-tuning. This enables, for the first time, efficient 4-bit fine-tuning of Llama-2-13B and Stable Diffusion 3.5 Large on a single 24GB GPU.

Technology Category

Application Category

📝 Abstract

As the size of large language models grows exponentially, GPU memory has become a bottleneck for adapting these models to downstream tasks. In this paper, we aim to push the limits of memory-efficient training by minimizing memory usage on model weights, gradients, and optimizer states, within a unified framework. Our idea is to eliminate both gradients and optimizer states using zeroth-order optimization, which approximates gradients by perturbing weights during forward passes to identify gradient directions. To minimize memory usage on weights, we employ model quantization, e.g., converting from bfloat16 to int4. However, directly applying zeroth-order optimization to quantized weights is infeasible due to the precision gap between discrete weights and continuous gradients, which would otherwise require de-quantization and re-quantization. To overcome this challenge, we propose Quantized Zeroth-order Optimization (QZO), a novel approach that perturbs the continuous quantization scale for gradient estimation and uses a directional derivative clipping method to stabilize training. QZO is orthogonal to both scalar-based and codebook-based post-training quantization methods. Compared to full-parameter fine-tuning in bfloat16, QZO can reduce the total memory cost by more than 18$ imes$ for 4-bit LLMs, and enables fine-tuning Llama-2-13B and Stable Diffusion 3.5 Large within a single 24GB GPU.

Problem

Research questions and friction points this paper is trying to address.

Reduce GPU memory usage for fine-tuning large language models

Enable zeroth-order optimization with quantized neural networks

Overcome precision gap between discrete weights and continuous gradients

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zeroth-order optimization for gradient approximation

Quantized weights using int4 to save memory

Perturbs quantization scale for gradient estimation

🔎 Similar Papers

No similar papers found.

Authors to Follow