🤖 AI Summary
To address performance degradation in 2D image classification under complex backgrounds, occlusion, and non-target interference, this paper proposes SIM-Net—a multimodal classification network that jointly leverages RGB images and 3D point clouds with 2D texture features. Its core innovation is a pixel-to-point transformation module, which unsupervisedly maps 2D object masks onto structured 3D point clouds, enabling cross-modal alignment between geometric priors and appearance features. Subsequently, dual encoders—CNN for texture and PointNet for geometry—extract complementary representations, which are fused discriminatively within a shared latent space. Evaluated on a plant specimen dataset, SIM-Net achieves a 9.9% accuracy gain and a 12.3% F-score improvement over ResNet-101, and significantly outperforms state-of-the-art Transformer-based methods. These results demonstrate that incorporating interpretable 3D structural reasoning enhances robustness and discriminability in fine-grained classification.
📝 Abstract
We introduce the Shape-Image Multimodal Network (SIM-Net), a novel 2D image classification architecture that integrates 3D point cloud representations inferred directly from RGB images. Our key contribution lies in a pixel-to-point transformation that converts 2D object masks into 3D point clouds, enabling the fusion of texture-based and geometric features for enhanced classification performance. SIM-Net is particularly well-suited for the classification of digitized herbarium specimens (a task made challenging by heterogeneous backgrounds), non-plant elements, and occlusions that compromise conventional image-based models. To address these issues, SIM-Net employs a segmentation-based preprocessing step to extract object masks prior to 3D point cloud generation. The architecture comprises a CNN encoder for 2D image features and a PointNet-based encoder for geometric features, which are fused into a unified latent space. Experimental evaluations on herbarium datasets demonstrate that SIM-Net consistently outperforms ResNet101, achieving gains of up to 9.9% in accuracy and 12.3% in F-score. It also surpasses several transformer-based state-of-the-art architectures, highlighting the benefits of incorporating 3D structural reasoning into 2D image classification tasks.