CDM-QTA: Quantized Training Acceleration for Efficient LoRA Fine-Tuning of Diffusion Model

📅 2025-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high power consumption, low efficiency, and memory bottlenecks in mobile LoRA fine-tuning of large diffusion models, this work proposes the first full-precision quantized training acceleration architecture specifically designed for LoRA fine-tuning. The architecture introduces a high-utilization dataflow supporting irregular tensor shapes, integrating full-precision quantized training, a customized LoRA-specific dataflow, and hardware-software co-optimization for low-rank adaptation. Experiments demonstrate 1.81× training speedup and 5.50× energy efficiency improvement over baseline implementations, with negligible degradation in image generation quality (ΔFID < 0.3). The core contribution lies in the first end-to-end integration of full-precision quantized training and LoRA-aware hardware acceleration, establishing a scalable paradigm for efficient diffusion model fine-tuning on edge devices.

Technology Category

Application Category

📝 Abstract
Fine-tuning large diffusion models for custom applications demands substantial power and time, which poses significant challenges for efficient implementation on mobile devices. In this paper, we develop a novel training accelerator specifically for Low-Rank Adaptation (LoRA) of diffusion models, aiming to streamline the process and reduce computational complexity. By leveraging a fully quantized training scheme for LoRA fine-tuning, we achieve substantial reductions in memory usage and power consumption while maintaining high model fidelity. The proposed accelerator features flexible dataflow, enabling high utilization for irregular and variable tensor shapes during the LoRA process. Experimental results show up to 1.81x training speedup and 5.50x energy efficiency improvements compared to the baseline, with minimal impact on image generation quality.
Problem

Research questions and friction points this paper is trying to address.

Accelerate LoRA fine-tuning for diffusion models
Reduce memory and power usage in training
Maintain model fidelity with quantized training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantized training scheme for LoRA fine-tuning
Flexible dataflow for irregular tensor shapes
Accelerator reduces memory and power usage
🔎 Similar Papers
No similar papers found.