🤖 AI Summary
To address the challenges of modeling complex spatial layouts and multi-scale features in mining area scene classification, this paper introduces the first mining-area-specific multimodal land-cover scene classification dataset and proposes a dual-stream collaborative representation model. The model comprises (i) a multi-scale global Transformer branch to capture long-range semantic dependencies, and (ii) a local-enhanced attention branch to focus on fine-grained spatial variations. These branches are integrated via a deep feature fusion module and jointly optimized with a multi-loss objective. Key innovations include semantic vector decoupling of global features and a context-aware local attention weighting mechanism. Extensive experiments demonstrate that the proposed method achieves state-of-the-art performance—83.63% accuracy—and significantly outperforms existing approaches across all evaluation metrics, substantially improving fine-grained scene recognition capability in mining areas.
📝 Abstract
Scene classification of mining areas provides accurate foundational data for geological environment monitoring and resource development planning. This study fuses multi-source data to construct a multi-modal mine land cover scene classification dataset. A significant challenge in mining area classification lies in the complex spatial layout and multi-scale characteristics. By extracting global and local features, it becomes possible to comprehensively reflect the spatial distribution, thereby enabling a more accurate capture of the holistic characteristics of mining scenes. We propose a dual-branch fusion model utilizing collaborative representation to decompose global features into a set of key semantic vectors. This model comprises three key components:(1) Multi-scale Global Transformer Branch: It leverages adjacent large-scale features to generate global channel attention features for small-scale features, effectively capturing the multi-scale feature relationships. (2) Local Enhancement Collaborative Representation Branch: It refines the attention weights by leveraging local features and reconstructed key semantic sets, ensuring that the local context and detailed characteristics of the mining area are effectively integrated. This enhances the model's sensitivity to fine-grained spatial variations. (3) Dual-Branch Deep Feature Fusion Module: It fuses the complementary features of the two branches to incorporate more scene information. This fusion strengthens the model's ability to distinguish and classify complex mining landscapes. Finally, this study employs multi-loss computation to ensure a balanced integration of the modules. The overall accuracy of this model is 83.63%, which outperforms other comparative models. Additionally, it achieves the best performance across all other evaluation metrics.