🤖 AI Summary
Accurately modeling performance for large language model (LLM) training and inference on GPU clusters remains challenging due to complex hardware-software interactions, dynamic memory behaviors, and emerging resilience requirements. Method: This paper introduces the first closed-loop simulation framework integrating the DeepFlow frontend with an extended Astra-Sim backend. It enables operator-level hardware-aware execution trace generation, tile-grained latency modeling, activation-lifecycle-driven memory feasibility pruning, and—novelly—quantitative evaluation of resilience scenarios including soft link failures and HBM bandwidth degradation. It further proposes congestion-aware routing and hybrid parallelism configuration-space traversal for multi-topology communication load and fault-sensitivity analysis. Results: Evaluated on A100 clusters, the framework achieves ≤10.4% prediction error for Llama inference step latency and GPT-scale training batch time, and ≤8% error versus ns-3 packet-level simulation for communication load—enabling millisecond-scale full-configuration sweep and resilience assessment.
📝 Abstract
RAPID-LLM is a unified performance modeling framework for large language model (LLM) training and inference on GPU clusters. It couples a DeepFlow-based frontend that generates hardware-aware, operator-level Chakra execution traces from an abstract LLM specification (model shape, batch/sequence settings, training vs. inference, and hybrid parallelism choices) with an extended Astra-Sim backend that executes those traces on explicit multi-dimensional network topologies with congestion-aware routing and support for degraded and faulty links. The frontend assigns per-operator latency using a tile-based model that accounts for SM under-utilization and multi-level memory traffic (SRAM/ L2/ HBM), and prunes memory-infeasible configurations using an activation-liveness traversal under recomputation, parallelism and ZeRO/FDSP sharding policies.
Across A100-based validation cases, RAPID-LLM predicts Llama inference step latency and GPT-scale training time per batch within 10.4% relative to published measurements, and matches ns-3 packet-level results within 8% on representative communication workloads. Case studies demonstrate how RAPID-LLM enables fast, exhaustive sweeps over hybrid-parallel configurations, quantifies sensitivity to soft link faults under realistic routing and congestion, and evaluates hypothetical GPU design variants including HBM bandwidth throttling effects.