ActVAR: Activating Mixtures of Weights and Tokens for Efficient Visual Autoregressive Generation

📅 2025-11-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Visual autoregressive (VAR) models suffer from prohibitive computational overhead due to quadratic sequence-length growth, while static pruning often disrupts pretrained dependencies and degrades generation quality. To address this, we propose ActVAR—a novel dual-sparse dynamic architecture that jointly optimizes expert subnetwork activation and token selection in a content-aware manner via a learnable router and a gated token selector. Furthermore, ActVAR integrates a factorized feed-forward network and a two-stage knowledge distillation scheme to dynamically sparsify both weights and token computations—without modifying the backbone architecture. Evaluated on ImageNet at 256×256 resolution, ActVAR reduces FLOPs by up to 21.2% with negligible performance degradation, significantly outperforming static pruning baselines. It achieves an effective trade-off among generation fidelity, inference efficiency, and global contextual consistency.

Technology Category

Application Category

📝 Abstract
Visual Autoregressive (VAR) models enable efficient image generation via next-scale prediction but face escalating computational costs as sequence length grows. Existing static pruning methods degrade performance by permanently removing weights or tokens, disrupting pretrained dependencies. To address this, we propose ActVAR, a dynamic activation framework that introduces dual sparsity across model weights and token sequences to enhance efficiency without sacrificing capacity. ActVAR decomposes feedforward networks (FFNs) into lightweight expert sub-networks and employs a learnable router to dynamically select token-specific expert subsets based on content. Simultaneously, a gated token selector identifies high-update-potential tokens for computation while reconstructing unselected tokens to preserve global context and sequence alignment. Training employs a two-stage knowledge distillation strategy, where the original VAR model supervises the learning of routing and gating policies to align with pretrained knowledge. Experiments on the ImageNet $256 imes 256$ benchmark demonstrate that ActVAR achieves up to $21.2%$ FLOPs reduction with minimal performance degradation.
Problem

Research questions and friction points this paper is trying to address.

Reduces computational costs in visual autoregressive image generation models
Maintains model performance while dynamically activating weights and tokens
Preserves pretrained dependencies through selective computation and reconstruction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic activation of weights and tokens
Learned routing for token-specific expert selection
Two-stage knowledge distillation for policy alignment
🔎 Similar Papers
No similar papers found.
K
Kaixin Zhang
School of Computer Science and Engineering, Central South University
R
Ruiqing Yang
University of Electronic Science and Technology of China
Y
Yuan Zhang
School of Computer Science, Peking University
Shan You
Shan You
SenseTime Research
deep learningmultimodal LLMedge AI
T
Tao Huang
Shanghai Jiao Tong University