FlashSketch: Sketch-Kernel Co-Design for Fast Sparse Sketching on GPUs

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work addresses the inefficiency of sparse sketching on GPUs, where irregular memory accesses caused by random sparsity severely limit bandwidth utilization and computational throughput. To overcome this, the authors propose a co-designed approach combining a novel BlockPerm-SJLT sparse structure with a customized FlashSketch CUDA kernel, yielding the first GPU-optimized sparse sketching method that preserves the theoretical guarantees of Oblivious Subspace Embedding. The design introduces tunable parameters to explicitly balance efficiency and accuracy. Experimental results demonstrate that the proposed method achieves a 1.7× geometric mean speedup over existing GPU sketching schemes on RandNLA benchmarks and GraSS data attribution tasks, while significantly advancing the Pareto frontier between speed and accuracy.

Technology Category

Application Category

📝 Abstract

Sparse sketches such as the sparse Johnson-Lindenstrauss transform are a core primitive in randomized numerical linear algebra because they leverage random sparsity to reduce the arithmetic cost of sketching, while still offering strong approximation guarantees. Their random sparsity, however, is at odds with efficient implementations on modern GPUs, since it leads to irregular memory access patterns that degrade memory bandwidth utilization. Motivated by this tension, we pursue a sketch-kernel co-design approach: we design a new family of sparse sketches, BlockPerm-SJLT, whose sparsity structure is chosen to enable FlashSketch, a corresponding optimized CUDA kernel that implements these sketches efficiently. The design of BlockPerm-SJLT introduces a tunable parameter that explicitly trades off the tension between GPU-efficiency and sketching robustness. We provide theoretical guarantees for BlockPerm-SJLT under the oblivious subspace embedding (OSE) framework, and also analyze the effect of the tunable parameter on sketching quality. We empirically evaluate FlashSketch on standard RandNLA benchmarks, as well as an end-to-end ML data attribution pipeline called GraSS. FlashSketch pushes the Pareto frontier of sketching quality versus speed, across a range of regimes and tasks, and achieves a global geomean speedup of roughly 1.7x over the prior state-of-the-art GPU sketches.

Problem

Research questions and friction points this paper is trying to address.

sparse sketching

GPU efficiency

irregular memory access

randomized numerical linear algebra

memory bandwidth

Innovation

Methods, ideas, or system contributions that make the work stand out.

sketch-kernel co-design

BlockPerm-SJLT

GPU-efficient sketching