Matrix-Free 3D SIMP Topology Optimization with Fused Gather-GEMM-Scatter Kernels

📅 2026-04-20

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

This study addresses the high memory bandwidth overhead and computational bottleneck incurred by global stiffness matrix assembly in three-dimensional SIMP-based topology optimization. To overcome these limitations, the authors propose, for the first time, fusing the gather, GEMM, and scatter operations into a single CUDA kernel, leveraging a matrix-free approach combined with CuPy runtime compilation to eliminate intermediate writes to global memory. This strategy drastically reduces DRAM traffic, achieving end-to-end speedups of 4.6–7.3× on a single RTX 4090 GPU for cantilever beam problems ranging from 216k to 4.9 million elements, while simultaneously reducing energy consumption by 3.2–4.9×. The work also evaluates numerical stability under FP32/BF16 mixed precision, revealing that BF16’s high condition number can lead to premature stagnation of iterative solvers.

Technology Category

Application Category

📝 Abstract

The matrix-free gather-batched-GEMM-scatter pattern eliminates global stiffness assembly for three-dimensional SIMP topology optimization, but the conventional three-stage implementation forces avoidable DRAM traffic between stages. We present a single fused CUDA kernel, implemented through CuPy's runtime compilation interface, that performs gather, per-element stiffness multiplication, and scatter accumulation in one pass. On a single RTX 4090 (24 GB), the fused path reaches a problem-size-dependent 4.6-7.3x end-to-end SIMP wall-time speedup across 216k-4.9M cantilever elements and 4.4x on the 499,125-element torsion benchmark. Against the same-precision FP32 three-stage baseline, the fused path still yields 2.3-4.6x on cantilever and 2.8x on torsion. Isolated CUDA-event cantilever-operator measurements reach 8.9-13.8x per matvec call, while separate instrumented board-power traces at 216k and 1M show 3.2-4.9x lower energy than matched FP64 runs. A separate bridge stress test shows the same FP32-versus-FP64 three-stage trend under one distributed-load case; direct fused-kernel bridge benchmarks are not reported. We also evaluate a BF16 WMMA variant: a separate PyTorch BF16 GEMM proxy on matching tensor shapes yields 14.3x, but direct condition-number estimates of 6.1e5-2.3e6 across 64k-512k uniform-density test states imply BF16 conditioning products of 2.4e3-9.1e3, far above the 256 threshold, observed alongside BF16 iterative-refinement stagnation at the two tested inner tolerances.

Problem

Research questions and friction points this paper is trying to address.

topology optimization

matrix-free

SIMP

DRAM traffic

Innovation

Methods, ideas, or system contributions that make the work stand out.

matrix-free

fused kernel

topology optimization

SIMP