Post-Training Quantization for Audio Diffusion Transformers

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational cost and deployment challenges of audio diffusion Transformers (DiTs), this work proposes a co-optimization framework for low-bit quantization. First, we design a denoising-step-aware activation smoothing strategy that dynamically calibrates quantization scales to suppress outliers. Second, we introduce a lightweight LoRA branch initialized via singular value decomposition (SVD) to compensate for weight reconstruction errors induced by W4A8/W8A8 quantization. Our method synergistically integrates static and dynamic quantization paradigms, enabling dual-aware precision adaptation—across both input channels and diffusion timesteps. Experiments on Stable Audio Open demonstrate that our approach reduces memory footprint by 79%, preserves high-fidelity audio generation under W4A8 quantization, significantly improves perceptual quality at low bitwidths via dynamic quantization, and further lowers inference latency through static quantization.

Technology Category

Application Category

📝 Abstract
Diffusion Transformers (DiTs) enable high-quality audio synthesis but are often computationally intensive and require substantial storage, which limits their practical deployment. In this paper, we present a comprehensive evaluation of post-training quantization (PTQ) techniques for audio DiTs, analyzing the trade-offs between static and dynamic quantization schemes. We explore two practical extensions (1) a denoising-timestep-aware smoothing method that adapts quantization scales per-input-channel and timestep to mitigate activation outliers, and (2) a lightweight low-rank adapter (LoRA)-based branch derived from singular value decomposition (SVD) to compensate for residual weight errors. Using Stable Audio Open we benchmark W8A8 and W4A8 configurations across objective metrics and human perceptual ratings. Our results show that dynamic quantization preserves fidelity even at lower precision, while static methods remain competitive with lower latency. Overall, our findings show that low-precision DiTs can retain high-fidelity generation while reducing memory usage by up to 79%.
Problem

Research questions and friction points this paper is trying to address.

Reducing computational intensity and storage for audio diffusion transformers
Evaluating quantization trade-offs between static and dynamic schemes
Maintaining audio generation fidelity while cutting memory usage
Innovation

Methods, ideas, or system contributions that make the work stand out.

Post-training quantization reduces DiTs memory usage
Timestep-aware smoothing mitigates activation outliers
SVD-derived LoRA branch compensates weight errors
🔎 Similar Papers
No similar papers found.