🤖 AI Summary
The absence of a unified standard hinders observability, reproducibility, and hardware-software co-optimization of distributed machine learning workloads. This work proposes Chakra Execution Trace (ET), the first standardized, interoperable graph-based format tailored for distributed AI systems, which precisely captures critical operations, their dependencies, and resource constraints. An accompanying toolchain enables trace collection, analysis, synthesis, and replay, facilitating cross-platform performance benchmarking and co-design. The system has been validated on real-world AI clusters, adopted by MLCommons, and is being collaboratively developed by leading industry organizations including NVIDIA, AMD, and Meta.
📝 Abstract
The fast pace of artificial intelligence~(AI) innovation demands an agile methodology for observation, reproduction and optimization of distributed machine learning~(ML) workload behavior in production AI systems and enables efficient software-hardware~(SW-HW) co-design for future systems. We present Chakra, an open and portable ecosystem for performance benchmarking and co-design. The core component of Chakra is an open and interoperable graph-based representation of distributed AI/ML workloads, called Chakra execution trace~(ET). These ETs represent key operations, such as compute, memory, and communication, data and control dependencies, timing, and resource constraints. Additionally, Chakra includes a complementary set of tools and capabilities to enable the collection, analysis, generation, and adoption of Chakra ETs by a broad range of simulators, emulators, and replay tools. We present analysis of Chakra ETs collected on production AI clusters and demonstrate value via real-world case studies. Chakra has been adopted by MLCommons and has active contributions and engagement across the industry, including but not limited to NVIDIA, AMD, Meta, Keysight, HPE, and Scala, to name a few.