🤖 AI Summary
Generative text-to-audio diffusion models suffer from high inference energy consumption, making it challenging to simultaneously achieve high audio fidelity and energy efficiency.
Method: We introduce the first systematic energy quantification framework for such models, empirically evaluating seven state-of-the-art architectures. We analyze nonlinear relationships between energy consumption and key inference parameters—including sampling steps and audio resolution—via sensitivity analysis, multi-objective Pareto frontier modeling, and cross-model energy-efficiency normalization.
Contribution/Results: We propose a novel “Pareto energy-efficiency–quality co-optimization” paradigm for green AI. Experiments identify three low-energy, high-fidelity configurations that reduce energy consumption by up to 47% while degrading Mean Opinion Score (MOS) by less than 0.3. Our approach provides a reproducible, generalizable pathway for sustainable audio generation, enabling principled trade-offs between perceptual quality and computational sustainability.
📝 Abstract
Text-to-audio models have recently emerged as a powerful technology for generating sound from textual descriptions. However, their high computational demands raise concerns about energy consumption and environmental impact. In this paper, we conduct an analysis of the energy usage of 7 state-of-the-art text-to-audio diffusion-based generative models, evaluating to what extent variations in generation parameters affect energy consumption at inference time. We also aim to identify an optimal balance between audio quality and energy consumption by considering Pareto-optimal solutions across all selected models. Our findings provide insights into the trade-offs between performance and environmental impact, contributing to the development of more efficient generative audio models.