FlashSVD v1.5: Making Low-Rank Transformers Inference Actually Fast

📅 2026-05-08
📈 Citations: 0
Influential: 0
📄 PDF

career value

213K/year
🤖 AI Summary
Although singular value decomposition (SVD)-based compression reduces the parameter count and theoretical computational cost of Transformers, it often fails to deliver practical inference speedups due to fragmented execution paths and disparate overheads between the prefill and autoregressive decoding phases. This work proposes a unified low-rank Transformer inference runtime that maps diverse SVD-compressed models into a common factorized representation and optimizes execution through phase-customized kernels, dense key-value caching, packed MLP computation, and per-layer CUDA graph replay. For the first time, this approach enables efficient and unified support for multiple SVD-based compression methods, demonstrating the necessity of co-designing compression algorithms with inference runtimes. On standard decoding tasks, it achieves up to 2.55× faster decoding and 2.39× end-to-end speedup, with average improvements of 1.48× in decoding throughput and 1.44× in end-to-end performance across methods.
📝 Abstract
SVD-based Low-rank compression reduces transformer parameters and nominal FLOPs, but these savings often translate poorly into real LLM serving speedups. We show that this gap is largely a runtime problem: factorized checkpoints fragment execution paths, and the resulting overhead differs substantially between prefill and autoregressive decode. We present FlashSVD v1.5, a unified inference runtime for serving SVD-compressed transformers. FlashSVD v1.5 maps diverse public SVD compression families to a common factorized representation and combines phase-specific kernels with dense-KV decode, packed MLP execution, and per-layer CUDA-graph replay to reorganize the low-rank serving path into a thin runtime. Across representative decoder-serving settings, FlashSVD v1.5 achieves up to 2.55x decode and 2.39x end-to-end speedup, and it attains 1.48x average decode and 1.44x average end-to-end speedup across multiple popular SVD compression families. These results suggest that practical low-rank acceleration requires runtime co-design, not compression algorithms alone. Our code is available at: https://github.com/Zishan-Shao/FlashSVD.
Problem

Research questions and friction points this paper is trying to address.

low-rank compression
transformer inference
SVD
LLM serving
runtime overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

low-rank compression
SVD
inference runtime
CUDA graph
transformer acceleration