🤖 AI Summary
To address the scarcity of pixel-level annotations in semantic segmentation, this paper pioneers the integration of reinforcement learning (RL) reward mechanisms into purely segmentation tasks, enabling dual-granularity weakly supervised training with both image-level and pixel-level supervision. We propose the Reward-driven Semantic Segmentation (RSS) framework, whose core innovations are: (1) Progressive Scale Reward (PSR), which enhances the guidance capability of image-level feedback via multi-scale policy gradient optimization; and (2) Pairwise Spatial Discrepancy (PSD) modeling, which explicitly enforces spatial consistency in predictions. By unifying RL, weakly supervised optimization, and structured spatial modeling, RSS achieves stable convergence on benchmarks such as PASCAL VOC using only image-level rewards—outperforming state-of-the-art weakly supervised methods. This demonstrates the effectiveness and advancement of the “reward-driven segmentation” paradigm.
📝 Abstract
In real-world scenarios, pixel-level labeling is not always available. Sometimes, we need a semantic segmentation network, and even a visual encoder can have a high compatibility, and can be trained using various types of feedback beyond traditional labels, such as feedback that indicates the quality of the parsing results. To tackle this issue, we proposed RSS (Reward in Semantic Segmentation), the first practical application of reward-based reinforcement learning on pure semantic segmentation offered in two granular levels (pixel-level and image-level). RSS incorporates various novel technologies, such as progressive scale rewards (PSR) and pair-wise spatial difference (PSD), to ensure that the reward facilitates the convergence of the semantic segmentation network, especially under image-level rewards. Experiments and visualizations on benchmark datasets demonstrate that the proposed RSS can successfully ensure the convergence of the semantic segmentation network on two levels of rewards. Additionally, the RSS, which utilizes an image-level reward, outperforms existing weakly supervised methods that also rely solely on image-level signals during training.