Profiling and optimization of multi-card GPU machine learning jobs

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

career value

253K/year

🤖 AI Summary

This study addresses performance bottlenecks in large language model (LLM) fine-tuning on multi-GPU clusters, systematically analyzing three key challenges: iteration latency, VRAM utilization inefficiency, and cross-device memory transfer overhead. We propose a synergistic optimization framework integrating distributed data parallelism, hardware-aware scheduling, dynamic quantization-aware training (QAT), low-rank adaptation (LoRA/QLoRA), and direct preference optimization (DPO). Notably, this is the first unified empirical evaluation on NVIDIA H100 GPUs quantifying the efficiency boundaries of DPO, LoRA, QLoRA, and QAT under realistic fine-tuning workloads. Experiments demonstrate up to 47% reduction in iteration time, 39% decrease in peak VRAM consumption, and 52% reduction in inter-GPU memory traffic versus baseline configurations. Our core contribution is a principled, H100-targeted performance modeling and optimization methodology for LLM fine-tuning, which rigorously characterizes the applicability conditions and complementary gains of diverse parallelization strategies and precision-compression techniques.

Technology Category

Application Category

📝 Abstract

The effectiveness and efficiency of machine learning methodologies are crucial, especially with respect to the quality of results and computational cost. This paper discusses different model optimization techniques, providing a comprehensive analysis of key performance indicators. Several parallelization strategies for image recognition, adapted to different hardware and software configurations, including distributed data parallelism and distributed hardware processing, are analyzed. Selected optimization strategies are studied in detail, highlighting the related challenges and advantages of their implementation. Furthermore, the impact of different performance improvement techniques (DPO, LoRA, QLoRA, and QAT) on the tuning process of large language models is investigated. Experimental results illustrate how the nature of the task affects the iteration time in a multiprocessor environment, VRAM utilization, and overall memory transfers. Test scenarios are evaluated on the modern NVIDIA H100 GPU architecture.

Problem

Research questions and friction points this paper is trying to address.

Optimizing multi-card GPU performance for machine learning tasks

Analyzing parallelization strategies for image recognition on varied configurations

Investigating impact of DPO, LoRA, QLoRA, QAT on large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallelization strategies for image recognition

Optimization techniques for large language models

Performance evaluation on NVIDIA H100 GPU

🔎 Similar Papers

No similar papers found.