🤖 AI Summary
Evaluating network hardware behavior under large-scale ML training workloads is costly and low-fidelity due to reliance on expensive GPU-based testbeds. Method: This paper introduces Genie, a novel testing framework that generates realistic ML communication traffic solely on CPUs—eliminating the need for GPUs—and drives physical network devices on hardware testbeds; concurrently, it enhances ASTRA-sim to enable co-simulation of network microarchitectures and ML workloads. Contribution/Results: Genie establishes the first “CPU-driven + hardware-measured + simulation-enhanced” paradigm for network–ML co-evaluation, achieving, for the first time without GPUs, high-fidelity coupling between hardware-level network behavior and distributed training communication patterns. Across representative training scenarios, it achieves over 90% network performance prediction accuracy while reducing testing costs by 10×, significantly accelerating network architecture design and validation cycles.
📝 Abstract
This paper lays the foundation for Genie, a testing framework that captures the impact of real hardware network behavior on ML workload performance, without requiring expensive GPUs. Genie uses CPU-initiated traffic over a hardware testbed to emulate GPU to GPU communication, and adapts the ASTRA-sim simulator to model interaction between the network and the ML workload.