π€ AI Summary
To address the GPU performance bottleneck in Euclidean distance computation for high-dimensional similarity search, this paper proposes FaSTEDβthe first algorithm to leverage mixed-precision (FP16βFP32) tensor cores for exact Euclidean distance calculation. FaSTED jointly optimizes three key aspects: (i) reformulating distance computation as matrix multiplication amenable to tensor-core acceleration; (ii) hierarchical memory reuse across global memory, shared memory, and registers; and (iii) data movement orchestration to maximize tensor-core throughput and memory bandwidth utilization. Evaluated on real-world high-dimensional datasets, FaSTED achieves 2.5Γβ51Γ speedup over state-of-the-art baselines, with negligible accuracy loss (<0.06%), significantly outperforming FP32 and FP64 implementations. This work establishes a novel paradigm for extending tensor cores beyond traditional AI workloads to fundamental non-AI operators, demonstrating both practical efficiency gains and architectural generalizability.
π Abstract
Modern GPUs are equipped with tensor cores (TCs) that are commonly used for matrix multiplication in artificial intelligence workloads. However, because they have high computational throughput, they can lead to significant performance gains in other algorithms if they can be successfully exploited. We examine using TCs to compute Euclidean distance calculations, which are used in many data analytics applications. Prior work has only investigated using 64 bit floating point (FP64) data for computation; however, TCs can operate on lower precision floating point data (i.e., 16 bit matrix multiplication and 32 bit accumulation), which we refer to as FP16-32. FP16-32 TC peak throughput is so high that TCs are easily starved of data. We propose a Fast and Scalable Tensor core Euclidean Distance (FaSTED) algorithm. To achieve high computational throughput, we design FaSTED for significant hierarchical reuse of data and maximize memory utilization at every level (global memory, shared memory, and registers). We apply FaSTED to the application of similarity searches, which typically employ an indexing data structure to eliminate superfluous Euclidean distance calculations. We compare to the state-of-the-art (SOTA) TC Euclidean distance algorithm in the literature that employs FP64, as well as to two single precision (FP32) CUDA core algorithms that both employ an index. We find that across four real-world high-dimensional datasets spanning 128-960 dimensions, the mixed-precision brute force approach achieves a speedup over the SOTA algorithms of 2.5-51x. We also quantify the accuracy loss of our mixed precision algorithm to be less than <0.06% when compared to the FP64 baseline.