Masked Generative Nested Transformers with Decode Time Scaling

📅 2025-02-01

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing generative models suffer from high computational overhead and low inference efficiency during image/video decoding. This paper addresses these issues via dynamic model scaling and KV cache reuse at decoding time: (1) a multi-granularity token co-generation architecture built upon a nested parameter-sharing Transformer; (2) the first decoding-stage dynamic model sizing strategy, which adaptively allocates compute resources based on token generation importance; and (3) strided cache reuse to minimize redundant computation. Evaluated on ImageNet (256×256 image generation), UCF101, and Kinetics600 (video generation and frame prediction), our method reduces computational cost by approximately 3× while maintaining generation quality comparable to full-capacity baseline models. The core contribution is the first joint dynamic optimization of model capacity and KV cache during generation—achieving a principled trade-off between inference efficiency and perceptual fidelity.

Technology Category

Application Category

📝 Abstract

Recent advances in visual generation have made significant strides in producing content of exceptional quality. However, most methods suffer from a fundamental problem - a bottleneck of inference computational efficiency. Most of these algorithms involve multiple passes over a transformer model to generate tokens or denoise inputs. However, the model size is kept consistent throughout all iterations, which makes it computationally expensive. In this work, we aim to address this issue primarily through two key ideas - (a) not all parts of the generation process need equal compute, and we design a decode time model scaling schedule to utilize compute effectively, and (b) we can cache and reuse some of the computation. Combining these two ideas leads to using smaller models to process more tokens while large models process fewer tokens. These different-sized models do not increase the parameter size, as they share parameters. We rigorously experiment with ImageNet256$ imes$256 , UCF101, and Kinetics600 to showcase the efficacy of the proposed method for image/video generation and frame prediction. Our experiments show that with almost $3 imes$ less compute than baseline, our model obtains competitive performance.

Problem

Research questions and friction points this paper is trying to address.

Generative Models

Computational Efficiency

Image and Video Creation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked Generative Nested Transformer

Parameter Sharing

Efficient Image and Video Generation

🔎 Similar Papers

Non-autoregressive Sequence-to-Sequence Vision-Language Models