🤖 AI Summary
This work addresses the underutilization of multimodal local features in RGB-D indoor scene recognition by proposing a dynamic graph neural network approach. The method employs an adaptive node selection mechanism to extract salient local features from both RGB and depth modalities, constructs a hierarchical graph structure based on spatial relationships among objects, and leverages an attention mechanism to dynamically update graph connections for effective cross-modal feature fusion. Experimental results on the SUN RGB-D and NYU Depth v2 datasets demonstrate that the proposed approach significantly outperforms existing state-of-the-art methods, validating its capability to adaptively discover and integrate discriminative local features from dual modalities.
📝 Abstract
Multi-modality of color and depth, i.e., RGB-D, is of great importance in recent research of indoor scene recognition. In this kind of data representation, depth map is able to describe the 3D structure of scenes and geometric relations among objects. Previous works showed that local features of both modalities are vital for promotion of recognition accuracy. However, the problem of adaptive selection and effective exploitation on these key local features remains open in this field. In this paper, a dynamic graph model is proposed with adaptive node selection mechanism to solve the above problem. In this model, a dynamic graph is built up to model the relations among objects and scene, and a method of adaptive node selection is proposed to take key local features from both modalities of RGB and depth for graph modeling. After that, these nodes are grouped by three different levels, representing near or far relations among objects. Moreover, the graph model is updated dynamically according to attention weights. Finally, the updated and optimized features of RGB and depth modalities are fused together for indoor scene recognition. Experiments are performed on public datasets including SUN RGB-D and NYU Depth v2. Extensive results demonstrate that our method has superior performance when comparing to state-of-the-arts methods, and show that the proposed method is able to exploit crucial local features from both modalities of RGB and depth.