Towards Easy and Realistic Network Infrastructure Testing for Large-scale Machine Learning

📅 2025-04-29

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

Evaluating network hardware behavior under large-scale ML training workloads is costly and low-fidelity due to reliance on expensive GPU-based testbeds. Method: This paper introduces Genie, a novel testing framework that generates realistic ML communication traffic solely on CPUs—eliminating the need for GPUs—and drives physical network devices on hardware testbeds; concurrently, it enhances ASTRA-sim to enable co-simulation of network microarchitectures and ML workloads. Contribution/Results: Genie establishes the first “CPU-driven + hardware-measured + simulation-enhanced” paradigm for network–ML co-evaluation, achieving, for the first time without GPUs, high-fidelity coupling between hardware-level network behavior and distributed training communication patterns. Across representative training scenarios, it achieves over 90% network performance prediction accuracy while reducing testing costs by 10×, significantly accelerating network architecture design and validation cycles.

Technology Category

Application Category

📝 Abstract

This paper lays the foundation for Genie, a testing framework that captures the impact of real hardware network behavior on ML workload performance, without requiring expensive GPUs. Genie uses CPU-initiated traffic over a hardware testbed to emulate GPU to GPU communication, and adapts the ASTRA-sim simulator to model interaction between the network and the ML workload.

Problem

Research questions and friction points this paper is trying to address.

Testing network impact on ML performance without GPUs

Emulating GPU communication using CPU traffic

Modeling network-ML workload interaction via simulator

Innovation

Methods, ideas, or system contributions that make the work stand out.

CPU-initiated traffic emulates GPU communication

Hardware testbed for realistic network behavior

ASTRA-sim models network-workload interaction

🔎 Similar Papers

No similar papers found.

Nvidia

base salary range is 152,000 USD - 241,500 USD for Level 3, and 184,000 USD - 287,500 USD for Level 4.

US, CA, Santa Clara

AI/HPC System Performance Engineer