MVDRAM: Enabling GeMV Execution in Unmodified DRAM for Low-Bit LLM Acceleration

📅 2025-03-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Accelerating Generalized Matrix-Vector multiplication (GeMV) for low-bit large language model (LLM) inference on unmodified DRAM faces significant overhead from input pre-reordering and bit-transposition, hindering efficient in-memory computing. Method: This paper proposes the first hardware-modification-free, DRAM-native GeMV acceleration method. Leveraging GeMV’s data reuse and linear algebraic structure, it employs system-level co-scheduling, dataflow restructuring, and quantization-aware mapping to eliminate input reordering and output transposition overhead inherent in conventional Processing-Using-DRAM (PUD) approaches. Contribution/Results: On a DDR4 platform, the method achieves up to 7.29× latency reduction and 30.5× energy efficiency improvement for GeMV. For end-to-end 2-bit and 4-bit LLM inference, it delivers 2.18× and 1.31× throughput gains, respectively, along with 3.04× and 2.35× energy efficiency improvements.

Technology Category

Application Category

📝 Abstract
General matrix-vector multiplication (GeMV) remains a critical latency bottleneck in large language model (LLM) inference, even with quantized low-bit models. Processing-Using-DRAM (PUD), an analog in-DRAM computing technique, has the potential to repurpose on-device DRAM as a GeMV engine, offering additional high-throughput processing capabilities to widespread consumer devices without DRAM modifications. However, applying PUD to GeMV operations in the LLM inference pipeline incurs significant overheads $ extit{before}$ and $ extit{after}$ in-DRAM computation, diminishing the benefits of its high-throughput processing capabilities. This paper presents MVDRAM, the first practical system to accelerate GeMV operations for low-bit LLM inference using unmodified DRAM. By leveraging the data sharing patterns and mathematical linearity in GeMV operations, MVDRAM orchestrates the processor and DRAM to eliminate the costs associated with pre-arranging inputs and bit-transposition of outputs required in conventional PUD approaches. Our experimental evaluation with four DDR4 DRAM modules shows that MVDRAM achieves comparable or even better inference speed than the processor-based implementation for GeMV operations in low-bit (under 4-bit) LLM. In particular, MVDRAM achieves up to 7.29$ imes$ speedup and 30.5$ imes$ energy efficiency for low-bit GeMV operations. For end-to-end LLM inference, MVDRAM achieves 2.18$ imes$ and 1.31$ imes$ throughput improvements, along with 3.04$ imes$ and 2.35$ imes$ energy efficiency, for 2-bit and 4-bit quantized low-bit models, respectively. MVDRAM has the potential to redefine the AI hardware landscape by demonstrating the feasibility of standard DRAM as an LLM accelerator.
Problem

Research questions and friction points this paper is trying to address.

Accelerating GeMV operations in low-bit LLM inference
Reducing overheads in Processing-Using-DRAM for GeMV
Enabling unmodified DRAM as efficient LLM accelerator
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses unmodified DRAM for GeMV acceleration
Eliminates pre-arranging and bit-transposition overheads
Leverages data sharing and mathematical linearity
T
Tatsuya Kubo
The University of Tokyo
D
Daichi Tokuda
The University of Tokyo
T
Tomoya Nagatani
The University of Tokyo
M
Masayuki Usui
The University of Tokyo
L
Lei Qu
Microsoft Research
T
Ting Cao
Microsoft Research
Shinya Takamaeda-Yamazaki
Shinya Takamaeda-Yamazaki
The University of Tokyo
Computer ArchitectureHigh-Level Synthesis CompilerFPGA SystemAlgorithm/Hardware Co-designMachine Learning Acceleration