Fast and Scalable Mixed Precision Euclidean Distance Calculations Using GPU Tensor Cores

📅 2025-08-28

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

To address the GPU performance bottleneck in Euclidean distance computation for high-dimensional similarity search, this paper proposes FaSTED—the first algorithm to leverage mixed-precision (FP16–FP32) tensor cores for exact Euclidean distance calculation. FaSTED jointly optimizes three key aspects: (i) reformulating distance computation as matrix multiplication amenable to tensor-core acceleration; (ii) hierarchical memory reuse across global memory, shared memory, and registers; and (iii) data movement orchestration to maximize tensor-core throughput and memory bandwidth utilization. Evaluated on real-world high-dimensional datasets, FaSTED achieves 2.5×–51× speedup over state-of-the-art baselines, with negligible accuracy loss (<0.06%), significantly outperforming FP32 and FP64 implementations. This work establishes a novel paradigm for extending tensor cores beyond traditional AI workloads to fundamental non-AI operators, demonstrating both practical efficiency gains and architectural generalizability.

Technology Category

Application Category

📝 Abstract

Modern GPUs are equipped with tensor cores (TCs) that are commonly used for matrix multiplication in artificial intelligence workloads. However, because they have high computational throughput, they can lead to significant performance gains in other algorithms if they can be successfully exploited. We examine using TCs to compute Euclidean distance calculations, which are used in many data analytics applications. Prior work has only investigated using 64 bit floating point (FP64) data for computation; however, TCs can operate on lower precision floating point data (i.e., 16 bit matrix multiplication and 32 bit accumulation), which we refer to as FP16-32. FP16-32 TC peak throughput is so high that TCs are easily starved of data. We propose a Fast and Scalable Tensor core Euclidean Distance (FaSTED) algorithm. To achieve high computational throughput, we design FaSTED for significant hierarchical reuse of data and maximize memory utilization at every level (global memory, shared memory, and registers). We apply FaSTED to the application of similarity searches, which typically employ an indexing data structure to eliminate superfluous Euclidean distance calculations. We compare to the state-of-the-art (SOTA) TC Euclidean distance algorithm in the literature that employs FP64, as well as to two single precision (FP32) CUDA core algorithms that both employ an index. We find that across four real-world high-dimensional datasets spanning 128-960 dimensions, the mixed-precision brute force approach achieves a speedup over the SOTA algorithms of 2.5-51x. We also quantify the accuracy loss of our mixed precision algorithm to be less than <0.06% when compared to the FP64 baseline.

Problem

Research questions and friction points this paper is trying to address.

Exploiting GPU tensor cores for efficient Euclidean distance calculations

Addressing data starvation in mixed precision FP16-32 tensor operations

Developing scalable algorithm for high-dimensional similarity search applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixed precision Euclidean distance using GPU tensor cores

Hierarchical data reuse for high memory utilization

FP16-32 computation with accuracy loss under 0.06%

🔎 Similar Papers

No similar papers found.