🤖 AI Summary
Existing video diffusion Transformer compression methods rely on fixed architectures, making it difficult to dynamically allocate computational resources according to input content or denoising stages. This work proposes a joint compression framework that co-optimizes model width and depth through structure-aware pruning and input-adaptive routing. Specifically, it introduces an attention head importance scoring mechanism informed by spatiotemporal roles to prevent critical temporal heads from being erroneously pruned, and incorporates a lightweight router that dynamically selects which network blocks to execute based on the current timestep and visual content. Combined with knowledge distillation and progressive joint training, the method substantially reduces per-step computational cost on Wan2.1-14B while preserving generation quality across all VBench metrics, and further enables synergistic acceleration when integrated with step distillation.
📝 Abstract
Video Diffusion Transformers (DiTs) generate high-quality videos but demand substantial compute due to wide blocks, deep architectures, and iterative sampling. Recent methods reduce cost by compressing width, depth, or sampling steps, but typically commit to a fixed architecture that cannot adapt to individual inputs or denoising stages. We propose PARE (Pruning and Adaptive Routing for Efficient video generation), which jointly compresses width and depth with structure-aware pruning and input-adaptive routing. For width, we observe that attention heads specialize into spatial and temporal roles, and design importance scoring that accounts for this distinction to prevent motion-critical temporal heads from being pruned prematurely. For depth, we train a lightweight router conditioned on denoising timestep and visual content to dynamically select which blocks to execute at each step, enabling per-input compute adaptation rather than static block removal. A progressive pipeline first recovers width-pruned quality via distillation, then jointly optimizes the student and router to decouple the two learning objectives. Experiments on Wan2.1-14B for both image-to-video and text-to-video generation show that PARE substantially reduces per-step computation while preserving quality across VBench dimensions, and composes with step distillation for further acceleration.