🤖 AI Summary
Conventional regularization methods (e.g., Dropout) in Vision Transformers (ViTs) rely on fixed, unstructured sparsity patterns and lack adaptability to task-specific requirements and data complexity.
Method: We propose a likelihood-guided Bayesian sparsification framework based on a variational Ising model, enabling task-driven structured parameter pruning and dynamic attention sparsification. It integrates Ising priors into ViT weight distribution learning, jointly optimizing likelihood gradients and variational inference to achieve interpretable structured feature selection and well-calibrated probabilistic outputs.
Contribution/Results: The framework supports adaptive architecture search and uncertainty-aware modeling during training. Evaluated on MNIST and CIFAR benchmarks, it significantly improves generalization under sparse/noisy data, reduces calibration error by 12.6%, and achieves 37% parameter reduction—without sacrificing accuracy.
📝 Abstract
The transformer architecture has demonstrated strong performance in classification tasks involving structured and high-dimensional data. However, its success often hinges on large- scale training data and careful regularization to prevent overfitting. In this paper, we intro- duce a novel likelihood-guided variational Ising-based regularization framework for Vision Transformers (ViTs), which simultaneously enhances model generalization and dynamically prunes redundant parameters. The proposed variational Ising-based regularization approach leverages Bayesian sparsification techniques to impose structured sparsity on model weights, allowing for adaptive architecture search during training. Unlike traditional dropout-based methods, which enforce fixed sparsity patterns, the variational Ising-based regularization method learns task-adaptive regularization, improving both efficiency and interpretability. We evaluate our approach on benchmark vision datasets, including MNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100, demonstrating improved generalization under sparse, complex data and allowing for principled uncertainty quantification on both weights and selection parameters. Additionally, we show that the Ising regularizer leads to better-calibrated probability estimates and structured feature selection through uncertainty-aware attention mechanisms. Our results highlight the effectiveness of structured Bayesian sparsification in enhancing transformer-based architectures, offering a principled alternative to standard regularization techniques.