Representation Learning with Adaptive Superpixel Coding

📅 2025-08-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional vision Transformers suffer from fixed-grid patch partitioning, which hinders adaptability to diverse image content and limits semantic representation capability. To address this, we propose the Self-supervised Adaptive Superpixel Transformer (ASC), whose core innovation is a learnable, dynamic superpixel segmentation layer that replaces static patch embedding to enable content-aware, adaptive region aggregation. This layer is jointly optimized with the Transformer encoder in a fully self-supervised manner—requiring no human annotations—to learn structured visual representations. Evaluated on ImageNet classification as well as downstream detection and segmentation benchmarks, ASC consistently outperforms state-of-the-art methods including ViT and MAE. Results demonstrate superior generalization ability and enhanced semantic modeling capacity, validating the effectiveness of adaptive, content-driven spatial abstraction in vision Transformers.

Technology Category

Application Category

📝 Abstract
Deep learning vision models are typically tailored for specific modalities and often rely on domain-specific assumptions, such as the grid structures used by nearly all existing vision models. In this work, we propose a self-supervised model based on Transformers, which we call Adaptive Superpixel Coding (ASC). The key insight of our model is to overcome the limitations of traditional Vision Transformers, which depend on fixed-size and non-adaptive patch partitioning. Instead, ASC employs adaptive superpixel layers that dynamically adjust to the underlying image content. We analyze key properties of the approach that make it effective, and find that our method outperforms widely-used alternatives on standard image downstream task benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Overcoming fixed-size patch limitations in vision transformers
Adapting image representation to dynamic content features
Enhancing self-supervised learning for downstream vision tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive superpixel layers dynamically adjust content
Self-supervised Transformer model overcomes fixed patch limitations
Method outperforms alternatives on image benchmark tasks
🔎 Similar Papers
No similar papers found.