🤖 AI Summary
To address the two key bottlenecks in weakly supervised camouflaged object detection (WSCOD)—unreliable pseudo-labels and scribble annotation bias—this paper proposes a two-stage framework. In Stage I, a multi-agent debate mechanism and adaptive entropy-driven point sampling are introduced to significantly enhance the task-specificity and reliability of SAM-generated pseudo-masks. In Stage II, a frequency-aware progressive debiasing network (FADeNet) is developed, leveraging DCT/DWT decomposition, multi-level frequency feature fusion, and dynamic region re-weighted supervision to jointly model global structures and local details while explicitly correcting scribble bias. This work is the first to integrate multi-agent debate-based pseudo-label generation and frequency-domain debiasing into WSCOD. Evaluated on CAMO, COD10K, and NC4K, it achieves an mIoU of 62.3%, outperforming existing weakly supervised methods and narrowing the gap with fully supervised SOTA to within 3.5%.
📝 Abstract
Weakly-Supervised Camouflaged Object Detection (WSCOD) aims to locate and segment objects that are visually concealed within their surrounding scenes, relying solely on sparse supervision such as scribble annotations. Despite recent progress, existing WSCOD methods still lag far behind fully supervised ones due to two major limitations: (1) the pseudo masks generated by general-purpose segmentation models (e.g., SAM) and filtered via rules are often unreliable, as these models lack the task-specific semantic understanding required for effective pseudo labeling in COD; and (2) the neglect of inherent annotation bias in scribbles, which hinders the model from capturing the global structure of camouflaged objects. To overcome these challenges, we propose ${D}^{3}$ETOR, a two-stage WSCOD framework consisting of Debate-Enhanced Pseudo Labeling and Frequency-Aware Progressive Debiasing. In the first stage, we introduce an adaptive entropy-driven point sampling method and a multi-agent debate mechanism to enhance the capability of SAM for COD, improving the interpretability and precision of pseudo masks. In the second stage, we design FADeNet, which progressively fuses multi-level frequency-aware features to balance global semantic understanding with local detail modeling, while dynamically reweighting supervision strength across regions to alleviate scribble bias. By jointly exploiting the supervision signals from both the pseudo masks and scribble semantics, ${D}^{3}$ETOR significantly narrows the gap between weakly and fully supervised COD, achieving state-of-the-art performance on multiple benchmarks.