FlashSketch: Sketch-Kernel Co-Design for Fast Sparse Sketching on GPUs

📅 2026-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inefficiency of sparse sketching on GPUs, where irregular memory accesses caused by random sparsity severely limit bandwidth utilization and computational throughput. To overcome this, the authors propose a co-designed approach combining a novel BlockPerm-SJLT sparse structure with a customized FlashSketch CUDA kernel, yielding the first GPU-optimized sparse sketching method that preserves the theoretical guarantees of Oblivious Subspace Embedding. The design introduces tunable parameters to explicitly balance efficiency and accuracy. Experimental results demonstrate that the proposed method achieves a 1.7× geometric mean speedup over existing GPU sketching schemes on RandNLA benchmarks and GraSS data attribution tasks, while significantly advancing the Pareto frontier between speed and accuracy.

Technology Category

Application Category

📝 Abstract
Sparse sketches such as the sparse Johnson-Lindenstrauss transform are a core primitive in randomized numerical linear algebra because they leverage random sparsity to reduce the arithmetic cost of sketching, while still offering strong approximation guarantees. Their random sparsity, however, is at odds with efficient implementations on modern GPUs, since it leads to irregular memory access patterns that degrade memory bandwidth utilization. Motivated by this tension, we pursue a sketch-kernel co-design approach: we design a new family of sparse sketches, BlockPerm-SJLT, whose sparsity structure is chosen to enable FlashSketch, a corresponding optimized CUDA kernel that implements these sketches efficiently. The design of BlockPerm-SJLT introduces a tunable parameter that explicitly trades off the tension between GPU-efficiency and sketching robustness. We provide theoretical guarantees for BlockPerm-SJLT under the oblivious subspace embedding (OSE) framework, and also analyze the effect of the tunable parameter on sketching quality. We empirically evaluate FlashSketch on standard RandNLA benchmarks, as well as an end-to-end ML data attribution pipeline called GraSS. FlashSketch pushes the Pareto frontier of sketching quality versus speed, across a range of regimes and tasks, and achieves a global geomean speedup of roughly 1.7x over the prior state-of-the-art GPU sketches.
Problem

Research questions and friction points this paper is trying to address.

sparse sketching
GPU efficiency
irregular memory access
randomized numerical linear algebra
memory bandwidth
Innovation

Methods, ideas, or system contributions that make the work stand out.

sketch-kernel co-design
BlockPerm-SJLT
GPU-efficient sketching
sparse Johnson-Lindenstrauss transform
oblivious subspace embedding
🔎 Similar Papers
No similar papers found.
R
Rajat Vadiraj Dwaraknath
Institute for Computational and Mathematical Engineering (ICME), Stanford University
S
Sungyoon Kim
Department of Electrical Engineering, Stanford University
Mert Pilanci
Mert Pilanci
Stanford University
Machine LearningOptimizationNeural NetworksSignal ProcessingInformation Theory