Progressive Supernet Training for Efficient Visual Autoregressive Modeling

📅 2025-11-20

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

To address the excessive memory overhead in visual autoregressive (VAR) models during multi-scale generation—caused by cumulative key-value (KV) cache accumulation—this paper proposes VARiant, a scale-depth-aware supernet framework. Its core innovation lies in identifying and modeling the asymmetric dependency between spatial scale and network depth in VAR, enabling the design of isometric subnet sampling, a weight-sharing supernet architecture, and a progressive training strategy. VARiant is the first method to enable zero-cost, runtime depth switching within a single model and to surpass the Pareto-optimal frontier under fixed training budget constraints. On ImageNet, VARiant-d16/d8 reduces memory consumption by 40–65% with only marginal FID degradation (2.05/2.12); VARiant-d2 achieves 3.5× speedup and 80% memory reduction while maintaining competitive FID (2.97), demonstrating unprecedented balance between generation quality and computational efficiency.

Technology Category

Application Category

📝 Abstract

Visual Auto-Regressive (VAR) models significantly reduce inference steps through the "next-scale" prediction paradigm. However, progressive multi-scale generation incurs substantial memory overhead due to cumulative KV caching, limiting practical deployment. We observe a scale-depth asymmetric dependency in VAR: early scales exhibit extreme sensitivity to network depth, while later scales remain robust to depth reduction. Inspired by this, we propose VARiant: by equidistant sampling, we select multiple subnets ranging from 16 to 2 layers from the original 30-layer VAR-d30 network. Early scales are processed by the full network, while later scales utilize subnet. Subnet and the full network share weights, enabling flexible depth adjustment within a single model. However, weight sharing between subnet and the entire network can lead to optimization conflicts. To address this, we propose a progressive training strategy that breaks through the Pareto frontier of generation quality for both subnets and the full network under fixed-ratio training, achieving joint optimality. Experiments on ImageNet demonstrate that, compared to the pretrained VAR-d30 (FID 1.95), VARiant-d16 and VARiant-d8 achieve nearly equivalent quality (FID 2.05/2.12) while reducing memory consumption by 40-65%. VARiant-d2 achieves 3.5 times speedup and 80% memory reduction at moderate quality cost (FID 2.97). In terms of deployment, VARiant's single-model architecture supports zero-cost runtime depth switching and provides flexible deployment options from high quality to extreme efficiency, catering to diverse application scenarios.

Problem

Research questions and friction points this paper is trying to address.

Reduce memory overhead from KV caching in visual autoregressive models

Address optimization conflicts in weight-sharing subnets during training

Enable flexible deployment with single-model architecture supporting depth switching

Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive subnet training reduces memory overhead

Weight sharing enables flexible depth adjustment

Single model supports zero-cost runtime depth switching

🔎 Similar Papers

Non-autoregressive Sequence-to-Sequence Vision-Language Models