ViTCAE: ViT-based Class-conditioned Autoencoder

📅 2025-09-20

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Vision Transformer (ViT)-based autoencoders often neglect the global semantic role of the class token and employ static attention mechanisms, resulting in weak generative controllability and low training efficiency. Method: We propose ViTCAE, which explicitly models the class token as a global latent variable for class-conditional generation, enabling a “global semantics-guided local detail reconstruction” paradigm. Inspired by multi-agent consensus theory, we design an adaptive attention mechanism incorporating attention evolution distance and consensus-driven convergence-aware temperature scheduling to dynamically freeze attention heads during training. Contribution/Results: ViTCAE achieves high-fidelity image reconstruction while significantly improving generative controllability and model interpretability. On ImageNet-1K, it reduces computational overhead by approximately 32% compared to baseline ViT autoencoders. Our approach establishes a new paradigm for ViT-based generative modeling—efficient, interpretable, and controllable.

Technology Category

Application Category

📝 Abstract

Vision Transformer (ViT) based autoencoders often underutilize the global Class token and employ static attention mechanisms, limiting both generative control and optimization efficiency. This paper introduces ViTCAE, a framework that addresses these issues by re-purposing the Class token into a generative linchpin. In our architecture, the encoder maps the Class token to a global latent variable that dictates the prior distribution for local, patch-level latent variables, establishing a robust dependency where global semantics directly inform the synthesis of local details. Drawing inspiration from opinion dynamics, we treat each attention head as a dynamical system of interacting tokens seeking consensus. This perspective motivates a convergence-aware temperature scheduler that adaptively anneals each head's influence function based on its distributional stability. This process enables a principled head-freezing mechanism, guided by theoretically-grounded diagnostics like an attention evolution distance and a consensus/cluster functional. This technique prunes converged heads during training to significantly improve computational efficiency without sacrificing fidelity. By unifying a generative Class token with an adaptive attention mechanism rooted in multi-agent consensus theory, ViTCAE offers a more efficient and controllable approach to transformer-based generation.

Problem

Research questions and friction points this paper is trying to address.

Underutilized Class token limits generative control in ViT autoencoders

Static attention mechanisms reduce optimization efficiency during training

Lack of principled head-freezing methods for computational efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Class token repurposed as global latent variable

Convergence-aware temperature scheduler for attention heads

Head-freezing mechanism guided by theoretical diagnostics

🔎 Similar Papers

Masked Capsule Autoencoders