Towards Precise Scaling Laws for Video Diffusion Transformers

📅 2024-11-25

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Efficient training of video diffusion Transformers under constrained data and compute budgets remains challenging. Method: We conduct systematic empirical analysis, nonlinear regression modeling, and multi-objective optimization under compute constraints to uncover scalable patterns in model behavior. We identify high sensitivity of validation loss to learning rate and batch size, and derive the first unified scaling law jointly predicting model size, hyperparameters, and compute consumption. Contribution/Results: Our scaling formula enables accurate prediction of validation loss across arbitrary scales, facilitating controllable performance–cost trade-offs. Under a 10¹⁰ TFLOPs budget, it reduces inference cost by 40.1% while preserving generation quality. This work establishes a reproducible, generalizable theoretical and practical framework for scaling video generative models.

Technology Category

Application Category

📝 Abstract

Achieving optimal performance of video diffusion transformers within given data and compute budget is crucial due to their high training costs. This necessitates precisely determining the optimal model size and training hyperparameters before large-scale training. While scaling laws are employed in language models to predict performance, their existence and accurate derivation in visual generation models remain underexplored. In this paper, we systematically analyze scaling laws for video diffusion transformers and confirm their presence. Moreover, we discover that, unlike language models, video diffusion models are more sensitive to learning rate and batch size, two hyperparameters often not precisely modeled. To address this, we propose a new scaling law that predicts optimal hyperparameters for any model size and compute budget. Under these optimal settings, we achieve comparable performance and reduce inference costs by 40.1% compared to conventional scaling methods, within a compute budget of 1e10 TFlops. Furthermore, we establish a more generalized and precise relationship among validation loss, any model size, and compute budget. This enables performance prediction for non-optimal model sizes, which may also be appealed under practical inference cost constraints, achieving a better trade-off.

Problem

Research questions and friction points this paper is trying to address.

Video Diffusion Transformers

Optimization

Visual Generation Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Video Diffusion Transformer

Optimal Setting Prediction

Cost-Performance Balance

🔎 Similar Papers

Tora: Trajectory-oriented Diffusion Transformer for Video Generation