Google is all you need: Semi-Supervised Transfer Learning Strategy For Light Multimodal Multi-Task Classification Model

πŸ“… 2025-01-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address insufficient semantic understanding and high annotation costs in large-scale multi-label image classification (1–19 classes, excluding class 12), this paper proposes a lightweight multimodal multitask model. Methodologically, it introduces an interpretable vision–text fusion module that jointly models CNN-extracted visual features and text descriptions generated by an NLP model; additionally, a semi-supervised transfer learning strategy is employed to leverage limited labeled data to guide learning from unlabeled samples, thereby mitigating purely data-driven bias. Ablation studies demonstrate that the proposed model significantly outperforms baseline methods in both accuracy and generalization. The overall architecture is computationally efficient and compact, enabling end-to-end automatic image annotation. Its low inference overhead and robust performance render it suitable for practical deployment in real-world multi-label classification scenarios.

Technology Category

Application Category

πŸ“ Abstract
As the volume of digital image data increases, the effectiveness of image classification intensifies. This study introduces a robust multi-label classification system designed to assign multiple labels to a single image, addressing the complexity of images that may be associated with multiple categories (ranging from 1 to 19, excluding 12). We propose a multi-modal classifier that merges advanced image recognition algorithms with Natural Language Processing (NLP) models, incorporating a fusion module to integrate these distinct modalities. The purpose of integrating textual data is to enhance the accuracy of label prediction by providing contextual understanding that visual analysis alone cannot fully capture. Our proposed classification model combines Convolutional Neural Networks (CNN) for image processing with NLP techniques for analyzing textual description (i.e., captions). This approach includes rigorous training and validation phases, with each model component verified and analyzed through ablation experiments. Preliminary results demonstrate the classifier's accuracy and efficiency, highlighting its potential as an automatic image-labeling system.
Problem

Research questions and friction points this paper is trying to address.

Lightweight Model
Multi-label Classification
Large-scale Image Processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Convolutional Neural Network (CNN)
Multi-label Classification
Text Understanding Integration
πŸ”Ž Similar Papers
No similar papers found.
Haixu Liu
Haixu Liu
The University of Sydney
Deep Learning Computer Vision LLM
P
Penghao Jiang
The University of Sydney
Z
Zerui Tao
The University of Sydney