🤖 AI Summary
This work addresses the challenges of modeling multi-path hierarchical structures and underutilizing unlabeled data in multi-label remote sensing image classification. To this end, the authors propose an end-to-end scalable framework that integrates hierarchy-specific class tokens into a Vision Transformer and explicitly captures multi-path dependencies among labels via a graph convolutional network. Additionally, a self-supervised learning branch is embedded to effectively leverage unlabeled data. This approach represents the first attempt to jointly incorporate hierarchical label structure and semi-supervised learning in remote sensing multi-label classification. It achieves state-of-the-art performance across four benchmark datasets—UCM, AID, DFC-15, and MLRSNet—and demonstrates particularly significant improvements over existing methods in label-scarce scenarios.
📝 Abstract
Hierarchical multi-label classification (HMLC) is essential for modeling complex label dependencies in remote sensing. Existing methods, however, struggle with multi-path hierarchies where instances belong to multiple branches, and they rarely exploit unlabeled data. We introduce HELM (\textit{Hierarchical and Explicit Label Modeling}), a novel framework that overcomes these limitations. HELM: (i) uses hierarchy-specific class tokens within a Vision Transformer to capture nuanced label interactions; (ii) employs graph convolutional networks to explicitly encode the hierarchical structure and generate hierarchy-aware embeddings; and (iii) integrates a self-supervised branch to effectively leverage unlabeled imagery. We perform a comprehensive evaluation on four remote sensing image (RSI) datasets (UCM, AID, DFC-15, MLRSNet). HELM achieves state-of-the-art performance, consistently outperforming strong baselines in both supervised and semi-supervised settings, demonstrating particular strength in low-label scenarios.