A Survey of End-to-End Modeling for Distributed DNN Training: Workloads, Simulators, and TCO

📅 2025-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Rapid growth in distributed DNN model size far outpaces hardware evolution, making it increasingly difficult to design training systems that simultaneously achieve high efficiency and sustainability. Method: This paper introduces the first three-dimensional evaluation framework—spanning workload abstraction, simulation infrastructure, and total cost of ownership (TCO)/carbon emission modeling—grounded in systematic literature review, multi-dimensional comparative modeling, and quantitative analysis. Contribution/Results: We identify common limitations of existing simulators in workload characterization, resource modeling, and environmental impact assessment; propose a structured capability comparison matrix to clarify assumptions, functional boundaries, and applicability of each tool; and expose cross-layer modeling gaps, distilling key open research challenges. Our framework provides a reproducible, extensible benchmark and decision-support foundation for co-designing efficient, low-carbon distributed training systems.

Technology Category

Application Category

📝 Abstract
Distributed deep neural networks (DNNs) have become a cornerstone for scaling machine learning to meet the demands of increasingly complex applications. However, the rapid growth in model complexity far outpaces CMOS technology scaling, making sustainable and efficient system design a critical challenge. Addressing this requires coordinated co-design across software, hardware, and technology layers. Due to the prohibitive cost and complexity of deploying full-scale training systems, simulators play a pivotal role in enabling this design exploration. This survey reviews the landscape of distributed DNN training simulators, focusing on three major dimensions: workload representation, simulation infrastructure, and models for total cost of ownership (TCO) including carbon emissions. It covers how workloads are abstracted and used in simulation, outlines common workload representation methods, and includes comprehensive comparison tables covering both simulation frameworks and TCO/emissions models, detailing their capabilities, assumptions, and areas of focus. In addition to synthesizing existing tools, the survey highlights emerging trends, common limitations, and open research challenges across the stack. By providing a structured overview, this work supports informed decision-making in the design and evaluation of distributed training systems.
Problem

Research questions and friction points this paper is trying to address.

Survey distributed DNN training simulators' workload representation.
Compare simulation frameworks and TCO/emissions models.
Highlight trends and challenges in DNN system design.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Co-design across software, hardware, and technology layers
Simulators for distributed DNN training exploration
TCO and carbon emissions modeling in simulations
🔎 Similar Papers
No similar papers found.
J
Jonas Svedas
imec, 20 Station Road, Cambridge CB1 2JD, UK
H
Hannah Watson
imec, 20 Station Road, Cambridge CB1 2JD, UK
N
Nathan Laubeuf
imec, Kapeldreef 75, 3001 Leuven, Belgium
Diksha Moolchandani
Diksha Moolchandani
Researcher, IMEC Belgium
computer architecturehardware designCNN accelerators
A
Abubakr Nada
imec, Kapeldreef 75, 3001 Leuven, Belgium
Arjun Singh
Arjun Singh
imec, 20 Station Road, Cambridge CB1 2JD, UK
Dwaipayan Biswas
Dwaipayan Biswas
Program Director - XTCO Memory, IMEC Belgium
System Technology Co-optimisationDTCOVLSI digital design
J
James Myers
imec, 20 Station Road, Cambridge CB1 2JD, UK
Debjyoti Bhattacharjee
Debjyoti Bhattacharjee
Compute System Architecture, imec
Computer ArchitectureHPCMachine LearningEDA