MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

228K/year
🤖 AI Summary
The absence of a unified standard hinders observability, reproducibility, and hardware-software co-optimization of distributed machine learning workloads. This work proposes Chakra Execution Trace (ET), the first standardized, interoperable graph-based format tailored for distributed AI systems, which precisely captures critical operations, their dependencies, and resource constraints. An accompanying toolchain enables trace collection, analysis, synthesis, and replay, facilitating cross-platform performance benchmarking and co-design. The system has been validated on real-world AI clusters, adopted by MLCommons, and is being collaboratively developed by leading industry organizations including NVIDIA, AMD, and Meta.
📝 Abstract
The fast pace of artificial intelligence~(AI) innovation demands an agile methodology for observation, reproduction and optimization of distributed machine learning~(ML) workload behavior in production AI systems and enables efficient software-hardware~(SW-HW) co-design for future systems. We present Chakra, an open and portable ecosystem for performance benchmarking and co-design. The core component of Chakra is an open and interoperable graph-based representation of distributed AI/ML workloads, called Chakra execution trace~(ET). These ETs represent key operations, such as compute, memory, and communication, data and control dependencies, timing, and resource constraints. Additionally, Chakra includes a complementary set of tools and capabilities to enable the collection, analysis, generation, and adoption of Chakra ETs by a broad range of simulators, emulators, and replay tools. We present analysis of Chakra ETs collected on production AI clusters and demonstrate value via real-world case studies. Chakra has been adopted by MLCommons and has active contributions and engagement across the industry, including but not limited to NVIDIA, AMD, Meta, Keysight, HPE, and Scala, to name a few.
Problem

Research questions and friction points this paper is trying to address.

AI/ML workload benchmarking
performance analysis
SW-HW co-design
execution trace standardization
distributed machine learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chakra Execution Trace
performance benchmarking
SW-HW co-design
distributed ML workloads
standardized trace representation
🔎 Similar Papers
No similar papers found.
Srinivas Sridharan
Srinivas Sridharan
Corteva Agriscience
Applied PerceptionComputer VisionMachine LearningComputer Graphics
A
Andy Balogh
Keysight
B
Bradford M. Beckmann
AMD
B
Brian Coutinho
NVIDIA
Louis Feng
Louis Feng
UC Davis
Computer GraphicsArtificial Intelligence
S
Sheng Fu
NVIDIA
S
Sanshan Gao
NVIDIA
M
Mehryar Garakani
Scala Computing
Taekyung Heo
Taekyung Heo
NVIDIA
Computer SystemsComputer ArchitectureMemory Systems
D
David Kanter
MLCommons
J
Josh Ladd
NVIDIA
Z
Ziwei Li
Georgia Institute of Technology
W
Winston Liu
Keysight
C
Changhai Man
Georgia Institute of Technology
D
Dan Mihailescu
Keysight
S
Spandan More
AMD
J
Joongun Park
Georgia Institute of Technology
A
Ashwin Ramachandran
Meta
V
Vinay Ramakrishnaiah
AMD
Saeed Rashidi
Saeed Rashidi
Meta
Computer ArchitectureNetworkingCompilersScalable DNN Training PlatformsMemory Systems
Vijay Janapa Reddi
Vijay Janapa Reddi
Harvard University
Computer ArchitectureMachine Learning SystemsAutonomous Agents
Puneet Sharma
Puneet Sharma
Hewlett Packard Labs, HP Labs
Computer NewtorksSDNNFVWirelessMobility
P
Phio Tian
NVIDIA
William Won
William Won
School of Computer Science, Georgia Institute of Technology
Computer Science
H
Hanjiang Wu
Georgia Institute of Technology