RAPID-LLM: Resilience-Aware Performance analysis of Infrastructure for Distributed LLM Training and Inference

📅 2025-12-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Accurately modeling performance for large language model (LLM) training and inference on GPU clusters remains challenging due to complex hardware-software interactions, dynamic memory behaviors, and emerging resilience requirements. Method: This paper introduces the first closed-loop simulation framework integrating the DeepFlow frontend with an extended Astra-Sim backend. It enables operator-level hardware-aware execution trace generation, tile-grained latency modeling, activation-lifecycle-driven memory feasibility pruning, and—novelly—quantitative evaluation of resilience scenarios including soft link failures and HBM bandwidth degradation. It further proposes congestion-aware routing and hybrid parallelism configuration-space traversal for multi-topology communication load and fault-sensitivity analysis. Results: Evaluated on A100 clusters, the framework achieves ≤10.4% prediction error for Llama inference step latency and GPT-scale training batch time, and ≤8% error versus ns-3 packet-level simulation for communication load—enabling millisecond-scale full-configuration sweep and resilience assessment.

Technology Category

Application Category

📝 Abstract
RAPID-LLM is a unified performance modeling framework for large language model (LLM) training and inference on GPU clusters. It couples a DeepFlow-based frontend that generates hardware-aware, operator-level Chakra execution traces from an abstract LLM specification (model shape, batch/sequence settings, training vs. inference, and hybrid parallelism choices) with an extended Astra-Sim backend that executes those traces on explicit multi-dimensional network topologies with congestion-aware routing and support for degraded and faulty links. The frontend assigns per-operator latency using a tile-based model that accounts for SM under-utilization and multi-level memory traffic (SRAM/ L2/ HBM), and prunes memory-infeasible configurations using an activation-liveness traversal under recomputation, parallelism and ZeRO/FDSP sharding policies. Across A100-based validation cases, RAPID-LLM predicts Llama inference step latency and GPT-scale training time per batch within 10.4% relative to published measurements, and matches ns-3 packet-level results within 8% on representative communication workloads. Case studies demonstrate how RAPID-LLM enables fast, exhaustive sweeps over hybrid-parallel configurations, quantifies sensitivity to soft link faults under realistic routing and congestion, and evaluates hypothetical GPU design variants including HBM bandwidth throttling effects.
Problem

Research questions and friction points this paper is trying to address.

Models performance of distributed LLM training/inference on GPU clusters
Predicts latency and training time under various configurations and faults
Enables exhaustive analysis of hybrid-parallel setups and hardware variants
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified performance modeling framework for LLM training and inference
Couples DeepFlow frontend with extended Astra-Sim backend for execution
Uses tile-based latency model with congestion-aware routing and fault support
🔎 Similar Papers
No similar papers found.