DiTReducio: A Training-Free Acceleration for DiT-Based TTS via Progressive Calibration

📅 2025-09-11

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Diffusion Transformers (DiTs) achieve state-of-the-art performance in non-autoregressive text-to-speech (TTS), yet their high inference computational cost hinders practical deployment. This paper proposes DiTReducio, a training-free lightweight acceleration framework for DiT-based TTS. It introduces two zero-training compression mechanisms—Temporal Skipping and Branch Skipping—and integrates an attention-pattern-guided progressive calibration strategy to dynamically skip redundant timesteps and network branches during inference. Unlike distillation- or retraining-based approaches, DiTReducio imposes no additional training overhead while substantially reducing computational load. Evaluated on F5-TTS and MegaTTS3, it achieves a 75.4% reduction in FLOPs and a 37.1% improvement in real-time factor (RTF), with no degradation in speech naturalness or audio fidelity. Our key contribution is the first training-free, dynamic, architecture-aware inference compression framework for DiTs, uniquely balancing efficiency gains with high-fidelity generation quality.

Technology Category

Application Category

📝 Abstract

While Diffusion Transformers (DiT) have advanced non-autoregressive (NAR) speech synthesis, their high computational demands remain an limitation. Existing DiT-based text-to-speech (TTS) model acceleration approaches mainly focus on reducing sampling steps through distillation techniques, yet they remain constrained by training costs. We introduce DiTReducio, a training-free acceleration framework that compresses computations in DiT-based TTS models via progressive calibration. We propose two compression methods, Temporal Skipping and Branch Skipping, to eliminate redundant computations during inference. Moreover, based on two characteristic attention patterns identified within DiT layers, we devise a pattern-guided strategy to selectively apply the compression methods. Our method allows flexible modulation between generation quality and computational efficiency through adjustable compression thresholds. Experimental evaluations conducted on F5-TTS and MegaTTS 3 demonstrate that DiTReducio achieves a 75.4% reduction in FLOPs and improves the Real-Time Factor (RTF) by 37.1%, while preserving generation quality.

Problem

Research questions and friction points this paper is trying to address.

Accelerate DiT-based TTS models without training

Reduce computational demands in inference phase

Maintain speech quality while improving efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free acceleration via progressive calibration

Temporal and Branch Skipping compression methods

Pattern-guided strategy for selective compression

🔎 Similar Papers

Three-in-One: Fast and Accurate Transducer for Hybrid-Autoregressive ASR