Video Killed the Energy Budget: Characterizing the Latency and Power Regimes of Open Text-to-Video Models

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-video (T2V) generation suffers from high latency and energy consumption, yet its resource usage patterns remain poorly characterized. This work presents the first systematic evaluation of runtime latency and power consumption across six open-source T2V models—including WAN2.1-T2V—combining analytical modeling with fine-grained empirical measurements. We uncover fundamental scaling laws: energy consumption grows quadratically with spatial resolution and temporal length, and linearly with denoising steps. Building on these insights, we introduce the first energy-efficiency benchmark suite for open T2V models and propose a predictive, computation-constrained analytical model. Our findings provide quantitative foundations and practical design guidelines for algorithmic optimization, hardware co-design, and sustainable deployment of green generative video systems.

Technology Category

Application Category

📝 Abstract
Recent advances in text-to-video (T2V) generation have enabled the creation of high-fidelity, temporally coherent clips from natural language prompts. Yet these systems come with significant computational costs, and their energy demands remain poorly understood. In this paper, we present a systematic study of the latency and energy consumption of state-of-the-art open-source T2V models. We first develop a compute-bound analytical model that predicts scaling laws with respect to spatial resolution, temporal length, and denoising steps. We then validate these predictions through fine-grained experiments on WAN2.1-T2V, showing quadratic growth with spatial and temporal dimensions, and linear scaling with the number of denoising steps. Finally, we extend our analysis to six diverse T2V models, comparing their runtime and energy profiles under default settings. Our results provide both a benchmark reference and practical insights for designing and deploying more sustainable generative video systems.
Problem

Research questions and friction points this paper is trying to address.

Characterizing latency and energy consumption of text-to-video models
Developing analytical model for scaling laws prediction
Benchmarking runtime and energy profiles across diverse models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed compute-bound analytical model for scaling laws
Validated predictions through fine-grained WAN2.1-T2V experiments
Extended analysis to six diverse T2V models
🔎 Similar Papers
J
Julien Delavande
Hugging Face, ENS Paris-Saclay
R
Régis Pierrard
Hugging Face
Sasha Luccioni
Sasha Luccioni
Hugging Face
Machine LearningNatural Language ProcessingAI EthicsAI for Social GoodAI for Climate Change