🤖 AI Summary
Fine-grained, cross-platform real-time monitoring of GPU resources—particularly GPU memory peak usage and computational utilization—remains unsupported in Unix/Linux environments.
Method: This paper introduces the first lightweight, dependency-free Python tool leveraging the NVIDIA Management Library (NVML) API. It employs multithreading and process-hooking techniques to enable low-overhead (average 0.3%) background sampling and precise peak capture of CPU/GPU utilization and system/GPU memory consumption.
Contribution/Results: The tool unifies analysis across desktop and HPC environments with high accuracy (GPU memory peak error <2%). It enables job-level GPU resource profiling—the first such capability for fine-grained, runtime GPU characterization in HPC settings—thereby addressing a critical gap in production-grade GPU observability. The implementation is open-source and has been integrated into multiple scientific computing pipelines.
📝 Abstract
Determining the maximum usage of random-access memory (RAM) on both the motherboard and on a graphical processing unit (GPU) over the lifetime of a computing task can be extremely useful for troubleshooting points of failure as well as optimizing memory utilization, especially within a high-performance computing (HPC) setting. While there are tools for tracking compute time and RAM, including by job management tools themselves, tracking of GPU usage, to our knowledge, does not currently have sufficient solutions. We present gpu_tracker, a Python package that tracks the computational resource usage of a task while running in the background, including the real compute time that the task takes to complete, its maximum RAM usage, and the maximum GPU RAM usage, specifically for Nvidia GPUs. We demonstrate that gpu_tracker can seamlessly track computational resource usage with minimal overhead, both within desktop and HPC execution environments.