🤖 AI Summary
To address excessive GPU memory peaks during the backward pass in deep neural network training—which constrain model scale and batch size—this paper proposes a gradient-lifetime-aware dynamic memory scheduling mechanism. Our method jointly models computational graph dependencies and tensor lifetimes to enable fine-grained, zero-copy reuse of gradient memory. By integrating CUDA Graph with PyTorch Autograd hooks, it unifies static graph analysis and runtime dynamic memory reclamation, ensuring seamless compatibility with arbitrary automatic differentiation frameworks. Evaluated on ResNet-50 and ViT-L, our approach reduces backward-pass memory consumption by 47%–63%, improves training throughput by 1.8×, and enables doubling the batch size without gradient checkpointing.