🤖 AI Summary
In zero-shot learning, manually defined class-level semantic prototypes suffer from instance-level misalignment (e.g., occlusion, viewpoint variation) and coarse-grained semantic imprecision, leading to biased vision–semantic mapping and degraded knowledge transfer to unseen classes. To address this, we propose a prototype-guided curriculum learning framework: (1) samples are progressively selected for training based on cosine similarity to assess visual–semantic alignment; (2) an instance-level feedback mechanism dynamically refines class-level prototypes, mitigating the impact of label noise. Our method requires no additional supervision and significantly enhances model generalization. Extensive experiments on AWA2, SUN, and CUB benchmarks demonstrate state-of-the-art classification accuracy, confirming both effectiveness and robustness.
📝 Abstract
In Zero-Shot Learning (ZSL), embedding-based methods enable knowledge transfer from seen to unseen classes by learning a visual-semantic mapping from seen-class images to class-level semantic prototypes (e.g., attributes). However, these semantic prototypes are manually defined and may introduce noisy supervision for two main reasons: (i) instance-level mismatch: variations in perspective, occlusion, and annotation bias will cause discrepancies between individual sample and the class-level semantic prototypes; and (ii) class-level imprecision: the manually defined semantic prototypes may not accurately reflect the true semantics of the class. Consequently, the visual-semantic mapping will be misled, reducing the effectiveness of knowledge transfer to unseen classes. In this work, we propose a prototype-guided curriculum learning framework (dubbed as CLZSL), which mitigates instance-level mismatches through a Prototype-Guided Curriculum Learning (PCL) module and addresses class-level imprecision via a Prototype Update (PUP) module. Specifically, the PCL module prioritizes samples with high cosine similarity between their visual mappings and the class-level semantic prototypes, and progressively advances to less-aligned samples, thereby reducing the interference of instance-level mismatches to achieve accurate visual-semantic mapping. Besides, the PUP module dynamically updates the class-level semantic prototypes by leveraging the visual mappings learned from instances, thereby reducing class-level imprecision and further improving the visual-semantic mapping. Experiments were conducted on standard benchmark datasets-AWA2, SUN, and CUB-to verify the effectiveness of our method.