LowRA: Accurate and Efficient LoRA Fine-Tuning of LLMs under 2 Bits

📅 2025-02-12

📈 Citations: 0

✨ Influential: 0

career value

244K/year

🤖 AI Summary

To address the high GPU memory overhead and severe accuracy degradation under ultra-low-bit quantization (<2 bits) in LoRA fine-tuning of large language models (LLMs), this paper proposes the first framework enabling end-to-end 1.15-bit LoRA training. Methodologically, we introduce a fine-grained quantization mapping, dynamic threshold selection, and adaptive bit-width allocation, accelerated by custom CUDA kernels for efficient training. Our key contribution is breaking the sub-2-bit accuracy bottleneck in LoRA, achieving the first practical ultra-low-bit LoRA fine-tuning without significant performance loss. Extensive experiments across four mainstream LLMs and four benchmark datasets demonstrate that our method reduces GPU memory consumption by up to 50% compared to standard LoRA, while incurring an average accuracy drop of less than 1% at 1.15 bits—substantially improving training efficiency and scalability.

Technology Category

Application Category

📝 Abstract

Fine-tuning large language models (LLMs) is increasingly costly as models scale to hundreds of billions of parameters, and even parameter-efficient fine-tuning (PEFT) methods like LoRA remain resource-intensive. We introduce LowRA, the first framework to enable LoRA fine-tuning below 2 bits per parameter with minimal performance loss. LowRA optimizes fine-grained quantization - mapping, threshold selection, and precision assignment - while leveraging efficient CUDA kernels for scalable deployment. Extensive evaluations across 4 LLMs and 4 datasets show that LowRA achieves a superior performance-precision trade-off above 2 bits and remains accurate down to 1.15 bits, reducing memory usage by up to 50%. Our results highlight the potential of ultra-low-bit LoRA fine-tuning for resource-constrained environments.

Problem

Research questions and friction points this paper is trying to address.

Reduces LoRA fine-tuning resource costs

Enables sub-2-bit parameter fine-tuning

Optimizes quantization for minimal performance loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

LoRA fine-tuning under 2 bits

Fine-grained quantization optimization

Efficient CUDA kernels deployment

🔎 Similar Papers

ShareLoRA: Parameter Efficient and Robust Large Language Model Fine-tuning via Shared Low-Rank Adaptation