🤖 AI Summary
This work addresses the performance bottleneck in weakly supervised semantic segmentation caused by reliance on sparse annotations. The authors propose a novel feedforward 3D scene reconstruction–assisted supervision framework that leverages geometric structure recovered from 2D video sequences to propagate sparse labels across entire images. For the first time, feedforward 3D reconstruction is integrated with weakly supervised 2D segmentation through a dual student–teacher architecture, enforcing cross-modal semantic consistency between 2D and 3D representations. The method achieves state-of-the-art performance under sparse supervision without requiring additional annotations or incurring extra inference overhead, outperforming existing approaches by 2–7% in segmentation accuracy.
📝 Abstract
We present Rewis3d, a framework that leverages recent advances in feed-forward 3D reconstruction to significantly improve weakly supervised semantic segmentation on 2D images. Obtaining dense, pixel-level annotations remains a costly bottleneck for training segmentation models. Alleviating this issue, sparse annotations offer an efficient weakly-supervised alternative. However, they still incur a performance gap. To address this, we introduce a novel approach that leverages 3D scene reconstruction as an auxiliary supervisory signal. Our key insight is that 3D geometric structure recovered from 2D videos provides strong cues that can propagate sparse annotations across entire scenes. Specifically, a dual student-teacher architecture enforces semantic consistency between 2D images and reconstructed 3D point clouds, using state-of-the-art feed-forward reconstruction to generate reliable geometric supervision. Extensive experiments demonstrate that Rewis3d achieves state-of-the-art performance in sparse supervision, outperforming existing approaches by 2-7% without requiring additional labels or inference overhead.