🤖 AI Summary
To address low computational efficiency and high memory pressure in Multi-Head Attention (MHA) during large Transformer model training on NVIDIA Volta architecture (e.g., V100), this work proposes a fine-grained kernel fusion and dynamic shared-memory scheduling strategy tailored to Volta Tensor Cores. Our method integrates low-level CUDA optimizations, FP16/INT8 mixed-precision arithmetic, attention computation graph rewriting, and TCU-customized scheduling—achieving performance gains without accuracy degradation. Experimental results demonstrate a 3.2× improvement in MHA throughput and a 67% reduction in end-to-end inference latency. Notably, we achieve real-time inference for a 13B-parameter model on a single V100 GPU—the first such result on this hardware—while significantly enhancing GPU memory utilization efficiency and training scalability.