🤖 AI Summary
This work reveals that Cloud TPUs exhibit severe performance disadvantages—up to 4,693–6,908× slower—than GPUs for finite-field cryptographic computations, primarily due to the absence of wide-integer ALUs and extremely low spatial utilization (only 6.25% in the M dimension) of their matrix compute units. To address this, the authors propose a “spatial collapse” model that reformulates low-degree polynomial arithmetic into matrix-based Number Theoretic Transforms (NTT), integrated with Montgomery reduction for efficient finite-field operations. The study provides the first quantitative characterization of TPUs’ structural limitations in exact-domain computation and introduces a reproducible measurement framework grounded in HLO-level post-hoc validation, effectively circumventing interference from XLA fusion optimizations. This approach establishes a new paradigm for heterogeneous cryptographic computing.
📝 Abstract
We empirically characterise the cost-efficiency deficit between cloud Tensor Processing Units and GPUs for finite-field cryptography. Against A100 GPU baselines (cuZK), we measure a $[5{,}558\times, 6{,}908\times]$ deficit across v5p and v4 architectures under an FP32-mantissa staging discipline, and a $\sim$$4{,}693\times$ deficit using v5p's native \texttt{int32} accumulator. We analytically project this deficit into a fundamental arithmetic penalty (lacking wide-integer ALUs) and a spatial penalty. We demonstrate that evaluating concurrent multi-tenant deployments, where strict separation forces eager Montgomery reduction, yields a projected $5.19\times$ spatial collapse; relaxing this constraint theoretically recovers these spatial cycles, yet the underlying arithmetic penalty remains. To facilitate this characterisation, we deploy \codename as a measurement vehicle. By mapping low-degree polynomials onto matrix-form Number Theoretic Transforms, the scheduler stacks heterogeneous polynomials into dense 2D matrices, achieving $\sim$$100\%$ K-dimension column occupancy on uniform workloads ($>$$92\%$ on mixed-degree traces). However, despite optimal K-dimension packing, severe M-dimension under-utilisation (e.g., $6.25\%$ on v4) combined with overwhelming VPU-bound Montgomery reduction stalls mathematically starve the systolic arrays. A post-hoc HLO validator ensures these measurements remain structurally isolated against the XLA fusion engine. Our findings empirically demonstrate the structural inadequacy of AI-optimised systolic arrays for exact, high-throughput field arithmetic.