π€ AI Summary
To address insufficient semantic understanding and high annotation costs in large-scale multi-label image classification (1β19 classes, excluding class 12), this paper proposes a lightweight multimodal multitask model. Methodologically, it introduces an interpretable visionβtext fusion module that jointly models CNN-extracted visual features and text descriptions generated by an NLP model; additionally, a semi-supervised transfer learning strategy is employed to leverage limited labeled data to guide learning from unlabeled samples, thereby mitigating purely data-driven bias. Ablation studies demonstrate that the proposed model significantly outperforms baseline methods in both accuracy and generalization. The overall architecture is computationally efficient and compact, enabling end-to-end automatic image annotation. Its low inference overhead and robust performance render it suitable for practical deployment in real-world multi-label classification scenarios.
π Abstract
As the volume of digital image data increases, the effectiveness of image classification intensifies. This study introduces a robust multi-label classification system designed to assign multiple labels to a single image, addressing the complexity of images that may be associated with multiple categories (ranging from 1 to 19, excluding 12). We propose a multi-modal classifier that merges advanced image recognition algorithms with Natural Language Processing (NLP) models, incorporating a fusion module to integrate these distinct modalities. The purpose of integrating textual data is to enhance the accuracy of label prediction by providing contextual understanding that visual analysis alone cannot fully capture. Our proposed classification model combines Convolutional Neural Networks (CNN) for image processing with NLP techniques for analyzing textual description (i.e., captions). This approach includes rigorous training and validation phases, with each model component verified and analyzed through ablation experiments. Preliminary results demonstrate the classifier's accuracy and efficiency, highlighting its potential as an automatic image-labeling system.