FTCFormer: Fuzzy Token Clustering Transformer for Image Classification

📅 2025-07-14

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Existing Transformer-based image classification models partition images into uniform grid tokens, neglecting regional semantics and thus limiting representational capacity. To address this, we propose the Semantic-Aware Fuzzy Token Clustering Transformer (SAFCT), which dynamically identifies semantically salient centers via Density Peak Clustering (DPC) and Fuzzy K-Nearest Neighbors (FKNN). SAFCT further introduces a Spatial Connectivity Score (SCS) and Channel-level Merging (Cmerge) to enable semantic-driven token assignment and fusion. Our core innovation lies in integrating unsupervised clustering and fuzzy relational modeling directly into visual token generation—thereby jointly preserving semantic consistency and spatial structure. We comprehensively evaluate SAFCT across 32 diverse datasets spanning fine-grained, natural, medical, and remote sensing imagery. Compared to the TCFormer baseline, SAFCT achieves consistent accuracy improvements of +1.43%, +1.09%, +0.97%, and +0.55% on respective domains.

Technology Category

Application Category

📝 Abstract

Transformer-based deep neural networks have achieved remarkable success across various computer vision tasks, largely attributed to their long-range self-attention mechanism and scalability. However, most transformer architectures embed images into uniform, grid-based vision tokens, neglecting the underlying semantic meanings of image regions, resulting in suboptimal feature representations. To address this issue, we propose Fuzzy Token Clustering Transformer (FTCFormer), which incorporates a novel clustering-based downsampling module to dynamically generate vision tokens based on the semantic meanings instead of spatial positions. It allocates fewer tokens to less informative regions and more to represent semantically important regions, regardless of their spatial adjacency or shape irregularity. To further enhance feature extraction and representation, we propose a Density Peak Clustering-Fuzzy K-Nearest Neighbor (DPC-FKNN) mechanism for clustering center determination, a Spatial Connectivity Score (SCS) for token assignment, and a channel-wise merging (Cmerge) strategy for token merging. Extensive experiments on 32 datasets across diverse domains validate the effectiveness of FTCFormer on image classification, showing consistent improvements over the TCFormer baseline, achieving gains of improving 1.43% on five fine-grained datasets, 1.09% on six natural image datasets, 0.97% on three medical datasets and 0.55% on four remote sensing datasets. The code is available at: https://github.com/BaoBao0926/FTCFormer/tree/main.

Problem

Research questions and friction points this paper is trying to address.

Improves semantic-aware token generation in vision transformers

Enhances feature representation via clustering-based downsampling

Boosts image classification accuracy across diverse datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Clustering-based downsampling for dynamic token generation

DPC-FKNN mechanism for clustering center determination

SCS and Cmerge for enhanced token assignment and merging

🔎 Similar Papers

Semantic Equitable Clustering: A Simple and Effective Strategy for Clustering Vision Tokens