🤖 AI Summary
Diffusion Transformers (DiTs) achieve state-of-the-art performance in non-autoregressive text-to-speech (TTS), yet their high inference computational cost hinders practical deployment. This paper proposes DiTReducio, a training-free lightweight acceleration framework for DiT-based TTS. It introduces two zero-training compression mechanisms—Temporal Skipping and Branch Skipping—and integrates an attention-pattern-guided progressive calibration strategy to dynamically skip redundant timesteps and network branches during inference. Unlike distillation- or retraining-based approaches, DiTReducio imposes no additional training overhead while substantially reducing computational load. Evaluated on F5-TTS and MegaTTS3, it achieves a 75.4% reduction in FLOPs and a 37.1% improvement in real-time factor (RTF), with no degradation in speech naturalness or audio fidelity. Our key contribution is the first training-free, dynamic, architecture-aware inference compression framework for DiTs, uniquely balancing efficiency gains with high-fidelity generation quality.
📝 Abstract
While Diffusion Transformers (DiT) have advanced non-autoregressive (NAR) speech synthesis, their high computational demands remain an limitation. Existing DiT-based text-to-speech (TTS) model acceleration approaches mainly focus on reducing sampling steps through distillation techniques, yet they remain constrained by training costs. We introduce DiTReducio, a training-free acceleration framework that compresses computations in DiT-based TTS models via progressive calibration. We propose two compression methods, Temporal Skipping and Branch Skipping, to eliminate redundant computations during inference. Moreover, based on two characteristic attention patterns identified within DiT layers, we devise a pattern-guided strategy to selectively apply the compression methods. Our method allows flexible modulation between generation quality and computational efficiency through adjustable compression thresholds. Experimental evaluations conducted on F5-TTS and MegaTTS 3 demonstrate that DiTReducio achieves a 75.4% reduction in FLOPs and improves the Real-Time Factor (RTF) by 37.1%, while preserving generation quality.