ActVAR: Activating Mixtures of Weights and Tokens for Efficient Visual Autoregressive Generation

📅 2025-11-16

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Visual autoregressive (VAR) models suffer from prohibitive computational overhead due to quadratic sequence-length growth, while static pruning often disrupts pretrained dependencies and degrades generation quality. To address this, we propose ActVAR—a novel dual-sparse dynamic architecture that jointly optimizes expert subnetwork activation and token selection in a content-aware manner via a learnable router and a gated token selector. Furthermore, ActVAR integrates a factorized feed-forward network and a two-stage knowledge distillation scheme to dynamically sparsify both weights and token computations—without modifying the backbone architecture. Evaluated on ImageNet at 256×256 resolution, ActVAR reduces FLOPs by up to 21.2% with negligible performance degradation, significantly outperforming static pruning baselines. It achieves an effective trade-off among generation fidelity, inference efficiency, and global contextual consistency.

Technology Category

Application Category

📝 Abstract

Visual Autoregressive (VAR) models enable efficient image generation via next-scale prediction but face escalating computational costs as sequence length grows. Existing static pruning methods degrade performance by permanently removing weights or tokens, disrupting pretrained dependencies. To address this, we propose ActVAR, a dynamic activation framework that introduces dual sparsity across model weights and token sequences to enhance efficiency without sacrificing capacity. ActVAR decomposes feedforward networks (FFNs) into lightweight expert sub-networks and employs a learnable router to dynamically select token-specific expert subsets based on content. Simultaneously, a gated token selector identifies high-update-potential tokens for computation while reconstructing unselected tokens to preserve global context and sequence alignment. Training employs a two-stage knowledge distillation strategy, where the original VAR model supervises the learning of routing and gating policies to align with pretrained knowledge. Experiments on the ImageNet $256 imes 256$ benchmark demonstrate that ActVAR achieves up to $21.2%$ FLOPs reduction with minimal performance degradation.

Problem

Research questions and friction points this paper is trying to address.

Reduces computational costs in visual autoregressive image generation models

Maintains model performance while dynamically activating weights and tokens

Preserves pretrained dependencies through selective computation and reconstruction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic activation of weights and tokens

Learned routing for token-specific expert selection

Two-stage knowledge distillation for policy alignment

🔎 Similar Papers

Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation