๐ค AI Summary
To address the high cost of expert annotations and the weak discriminability of self-supervised representations in fine-grained visual classification (FGVC), this paper proposes a Cross-level Multi-Instance Distillation (CMID) framework. CMID introduces, for the first time, a collaborative intra-layer and inter-layer multi-instance knowledge distillation mechanism that explicitly models the contribution of discriminative image patches to fine-grained semanticsโthereby overcoming the misalignment between class-agnostic pre-trained features and fine-grained discriminative signals. The framework integrates multi-instance learning, region-to-image crop alignment, self-supervised contrastive learning, and cross-level feature relationship modeling. On CUB-200, Stanford Cars, and FGVC Aircraft, CMID achieves top-1 classification accuracy and Rank-1 retrieval rates surpassing state-of-the-art self-supervised methods by up to 19.78%, significantly enhancing the quality of fine-grained representations.
๐ Abstract
High-quality annotation of fine-grained visual categories demands great expert knowledge, which is taxing and time consuming. Alternatively, learning fine-grained visual representation from enormous unlabeled images (e.g., species, brands) by self-supervised learning becomes a feasible solution. However, recent investigations find that existing self-supervised learning methods are less qualified to represent fine-grained categories. The bottleneck lies in that the pre-trained class-agnostic representation is built from every patch-wise embedding, while fine-grained categories are only determined by several key patches of an image. In this paper, we propose a Cross-level Multi-instance Distillation (CMD) framework to tackle this challenge. Our key idea is to consider the importance of each image patch in determining the fine-grained representation by multiple instance learning. To comprehensively learn the relation between informative patches and fine-grained semantics, the multi-instance knowledge distillation is implemented on both the region/image crop pairs from the teacher and student net, and the region-image crops inside the teacher / student net, which we term as intra-level multi-instance distillation and inter-level multi-instance distillation. Extensive experiments on several commonly used datasets, including CUB-200-2011, Stanford Cars and FGVC Aircraft, demonstrate that the proposed method outperforms the contemporary methods by up to 10.14% and existing state-of-the-art self-supervised learning approaches by up to 19.78% on both top-1 accuracy and Rank-1 retrieval metric. Source code is available at https://github.com/BiQiWHU/CMD