🤖 AI Summary
Existing Transformer-based image classification models partition images into uniform grid tokens, neglecting regional semantics and thus limiting representational capacity. To address this, we propose the Semantic-Aware Fuzzy Token Clustering Transformer (SAFCT), which dynamically identifies semantically salient centers via Density Peak Clustering (DPC) and Fuzzy K-Nearest Neighbors (FKNN). SAFCT further introduces a Spatial Connectivity Score (SCS) and Channel-level Merging (Cmerge) to enable semantic-driven token assignment and fusion. Our core innovation lies in integrating unsupervised clustering and fuzzy relational modeling directly into visual token generation—thereby jointly preserving semantic consistency and spatial structure. We comprehensively evaluate SAFCT across 32 diverse datasets spanning fine-grained, natural, medical, and remote sensing imagery. Compared to the TCFormer baseline, SAFCT achieves consistent accuracy improvements of +1.43%, +1.09%, +0.97%, and +0.55% on respective domains.
📝 Abstract
Transformer-based deep neural networks have achieved remarkable success across various computer vision tasks, largely attributed to their long-range self-attention mechanism and scalability. However, most transformer architectures embed images into uniform, grid-based vision tokens, neglecting the underlying semantic meanings of image regions, resulting in suboptimal feature representations. To address this issue, we propose Fuzzy Token Clustering Transformer (FTCFormer), which incorporates a novel clustering-based downsampling module to dynamically generate vision tokens based on the semantic meanings instead of spatial positions. It allocates fewer tokens to less informative regions and more to represent semantically important regions, regardless of their spatial adjacency or shape irregularity. To further enhance feature extraction and representation, we propose a Density Peak Clustering-Fuzzy K-Nearest Neighbor (DPC-FKNN) mechanism for clustering center determination, a Spatial Connectivity Score (SCS) for token assignment, and a channel-wise merging (Cmerge) strategy for token merging. Extensive experiments on 32 datasets across diverse domains validate the effectiveness of FTCFormer on image classification, showing consistent improvements over the TCFormer baseline, achieving gains of improving 1.43% on five fine-grained datasets, 1.09% on six natural image datasets, 0.97% on three medical datasets and 0.55% on four remote sensing datasets. The code is available at: https://github.com/BaoBao0926/FTCFormer/tree/main.