🤖 AI Summary
Existing SVD-based model compression methods reduce weight memory but overlook the substantial activation memory overhead induced by truncated singular vectors in dense CUDA kernels—a cost that escalates with sequence length and hidden dimension, leading to increased peak inference memory and hindering edge deployment. This work proposes the first end-to-end streaming inference framework tailored for SVD-compressed models. Our method integrates low-rank projection kernels, SRAM-efficient block-wise loading of truncated factors, on-chip computation, and immediate eviction—eliminating the need for full-sized activation caching. It requires no model architecture modifications and supports plug-and-play inference for arbitrary SVD-compressed models. Evaluated on BERT-Base, our approach reduces peak activation memory by 70.2% and transient memory by 75%, with zero accuracy degradation and no increase in inference latency.
📝 Abstract
Singular Value Decomposition (SVD) has recently seen a surge of interest as a simple yet powerful tool for large language models (LLMs) compression, with a growing number of works demonstrating 20-80% parameter reductions at minimal accuracy loss. Previous SVD-based approaches have focused primarily on reducing the memory footprint of model weights, largely overlooking the additional activation memory overhead incurred during inference when applying truncated factors via standard dense CUDA kernels. Our experiments demonstrate that this activation overhead, scaling with sequence length and hidden dimension, prevents current SVD compression techniques from achieving any reduction in peak inference memory, thereby limiting their viability for real-world, on-device deployments.
We introduce FlashSVD, a novel, end-to-end rank-aware streaming inference framework specifically designed for SVD-compressed large language models. FlashSVD can be seamlessly integrated with any model that employs SVD-based methods for parameter reduction. By fusing low-rank projection kernels directly into both the self-attention and feed-forward network (FFN) pipelines, FlashSVD avoid materializing full-size activation buffers. Instead, small tiles of the truncated factors are loaded into on-chip SRAM, multiplied and reduced on the fly, and immediately evicted, preserving high GPU occupancy and adding no extra latency. On standard encoder benchmarks (e.g., BERT-Base), FlashSVD cuts peak activation memory by up to 70.2% and intermediate transient memory by 75%, all while incur no accuracy loss with upstreaming compression methods, offering a practical path toward memory-constrained deployment of low-rank LLMs.