CLUENet: Cluster Attention Makes Neural Networks Have Eyes

📅 2025-12-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision models—such as CNNs and Transformers—suffer from fixed receptive fields and structural complexity, limiting their ability to model irregular spatial patterns and impairing interpretability. While clustering-based methods offer semantic transparency, they exhibit low accuracy, training instability, and computational inefficiency. To address these limitations, we propose CLUENet: a highly transparent vision network grounded in clustering-based attention. Its key innovations include global soft–hard hybrid aggregation, temperature-scaled cosine attention, gated residual connections, inter-block hard-shared feature distribution, and differentiable clustering pooling. This design jointly preserves local modeling fidelity and semantic flexibility while ensuring efficient gradient propagation and architectural traceability. Evaluated on CIFAR-100 and Mini-ImageNet, CLUENet significantly outperforms state-of-the-art clustering models and mainstream vision architectures, achieving balanced advances in accuracy, inference efficiency, and interpretability.

Technology Category

Application Category

📝 Abstract
Despite the success of convolution- and attention-based models in vision tasks, their rigid receptive fields and complex architectures limit their ability to model irregular spatial patterns and hinder interpretability, therefore posing challenges for tasks requiring high model transparency. Clustering paradigms offer promising interpretability and flexible semantic modeling, but suffer from limited accuracy, low efficiency, and gradient vanishing during training. To address these issues, we propose CLUster attEntion Network (CLUENet), an transparent deep architecture for visual semantic understanding. We propose three key innovations include (i) a Global Soft Aggregation and Hard Assignment with a Temperature-Scaled Cosin Attention and gated residual connections for enhanced local modeling, (ii) inter-block Hard and Shared Feature Dispatching, and (iii) an improved cluster pooling strategy. These enhancements significantly improve both classification performance and visual interpretability. Experiments on CIFAR-100 and Mini-ImageNet demonstrate that CLUENet outperforms existing clustering methods and mainstream visual models, offering a compelling balance of accuracy, efficiency, and transparency.
Problem

Research questions and friction points this paper is trying to address.

Enhances interpretability and flexible semantic modeling in vision tasks
Addresses limited accuracy and efficiency in clustering-based neural networks
Improves classification performance and transparency for visual understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Global soft aggregation with temperature-scaled attention
Inter-block hard and shared feature dispatching
Improved cluster pooling strategy for interpretability