DDiT: Dynamic Resource Allocation for Diffusion Transformer Model Serving

📅 2025-06-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current text-to-video (T2V) serving systems deploy monolithic models, ignoring heterogeneous resource requirements and dynamic parallelism across language encoders, DiT, and VAE modules—leading to low GPU utilization and high request latency. To address this, we propose a cross-stage and intra-stage collaborative dynamic resource scheduling framework. Our approach introduces, for the first time, a decoupled parallelism control mechanism and a per-step greedy scheduling algorithm, jointly optimizing DiT/VAE load balancing and request starvation time. Leveraging performance-model-driven dynamic parallelism decisions, fine-grained (per-step) resource scaling, and heterogeneous feature-aware allocation across modules, our framework significantly improves resource efficiency. Evaluated on mainstream T2V models including OpenSora, it reduces p99 and mean latency by 44% and 43%, respectively—outperforming state-of-the-art serving systems.

Technology Category

Application Category

📝 Abstract
The Text-to-Video (T2V) model aims to generate dynamic and expressive videos from textual prompts. The generation pipeline typically involves multiple modules, such as language encoder, Diffusion Transformer (DiT), and Variational Autoencoders (VAE). Existing serving systems often rely on monolithic model deployment, while overlooking the distinct characteristics of each module, leading to inefficient GPU utilization. In addition, DiT exhibits varying performance gains across different resolutions and degrees of parallelism, and significant optimization potential remains unexplored. To address these problems, we present DDiT, a flexible system that integrates both inter-phase and intra-phase optimizations. DDiT focuses on two key metrics: optimal degree of parallelism, which prevents excessive parallelism for specific resolutions, and starvation time, which quantifies the sacrifice of each request. To this end, DDiT introduces a decoupled control mechanism to minimize the computational inefficiency caused by imbalances in the degree of parallelism between the DiT and VAE phases. It also designs a greedy resource allocation algorithm with a novel scheduling mechanism that operates at the single-step granularity, enabling dynamic and timely resource scaling. Our evaluation on the T5 encoder, OpenSora SDDiT, and OpenSora VAE models across diverse datasets reveals that DDiT significantly outperforms state-of-the-art baselines by up to 1.44x in p99 latency and 1.43x in average latency.
Problem

Research questions and friction points this paper is trying to address.

Optimize GPU utilization in T2V model serving
Address varying DiT performance across resolutions
Reduce computational inefficiency in parallelism imbalance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled control mechanism for parallelism imbalance
Greedy resource allocation with dynamic scaling
Single-step granularity scheduling for efficiency
H
Heyang Huang
University of Chinese Academy of Sciences, State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences
C
Cunchen Hu
University of Chinese Academy of Sciences, State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences
J
Jiaqi Zhu
University of Chinese Academy of Sciences, State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences
Z
Ziyuan Gao
University of Chinese Academy of Sciences, State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences
L
Liangliang Xu
Institute of Mathematics and Interdisciplinary Sciences, Xidian University
Yizhou Shan
Yizhou Shan
Huawei Cloud
DisaggregationOperating SystemDistributed SystemComputer Architecture
Yungang Bao
Yungang Bao
Institute of Computing Technology (ICT), CAS
Computer ArchitectureComputer System
N
Ninghui Sun
University of Chinese Academy of Sciences, State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences
T
Tianwei Zhang
Nanyang Technological University
Sa Wang
Sa Wang
Associate Professor, Institute of Computing Technology, CAS
Cloud ComputingOperating Systems