🤖 AI Summary
This work addresses the challenging problem of detecting glass surfaces in videos by proposing an end-to-end spatiotemporal detection method based on motion inconsistency. Leveraging the disparity in motion dynamics between glass and background regions, the study introduces motion inconsistency as a key cue for the first time and designs a novel network architecture comprising optical flow estimation, a Cross-scale Multi-modal Fusion Module (CMFM), a History-Guided Attention Module (HGAM), a Temporal Cross-Attention Module (TCAM), and a Spatiotemporal Decoder (TSD) to effectively enhance spatiotemporal feature representation. Evaluated on a newly constructed large-scale video glass dataset containing 312 scenes and 19,268 frames, the proposed method significantly outperforms existing approaches, demonstrating its effectiveness and robustness.
📝 Abstract
Glass surface ubiquitous in both daily life and professional environments presents a potential threat to vision-based systems, such as robot and drone navigation. To solve this challenge, most recent studies have shown significant interest in Video Glass Surface Detection (VGSD). We observe that objects in the reflection (or transmission) layer appear farther from the glass surfaces. Consequently, in video motion scenarios, the notable reflected (or transmitted) objects on the glass surface move slower than objects in non-glass regions within the same spatial plane, and this motion inconsistency can effectively reveal the presence of glass surfaces. Based on this observation, we propose a novel network, named MVGD-Net, for detecting glass surfaces in videos by leveraging motion inconsistency cues. Our MVGD-Net features three novel modules: the Cross-scale Multimodal Fusion Module (CMFM) that integrates extracted spatial features and estimated optical flow maps, the History Guided Attention Module (HGAM) and Temporal Cross Attention Module (TCAM), both of which further enhances temporal features. A Temporal-Spatial Decoder (TSD) is also introduced to fuse the spatial and temporal features for generating the glass region mask. Furthermore, for learning our network, we also propose a large-scale dataset, which comprises 312 diverse glass scenarios with a total of 19,268 frames. Extensive experiments demonstrate that our MVGD-Net outperforms relevant state-of-the-art methods.