🤖 AI Summary
This work addresses the insufficient robustness of existing promptable segmentation models when deployed with real-world user-provided bounding box prompts, to which they exhibit high sensitivity even under minor perturbations. To systematically evaluate such vulnerability, we introduce BREPS—a novel methodology that formulates robustness testing as a white-box optimization problem balancing naturalness constraints and performance boundaries, enabling efficient evaluation through adversarial bounding box generation. Leveraging real user annotations collected via user studies, we construct a comprehensive benchmark spanning ten diverse datasets—from everyday scenes to medical imaging—and demonstrate the effectiveness of BREPS in revealing the fragility of current models under realistic prompting conditions.
📝 Abstract
Promptable segmentation models such as SAM have established a powerful paradigm, enabling strong generalization to unseen objects and domains with minimal user input, including points, bounding boxes, and text prompts. Among these, bounding boxes stand out as particularly effective, often outperforming points while significantly reducing annotation costs. However, current training and evaluation protocols typically rely on synthetic prompts generated through simple heuristics, offering limited insight into real-world robustness. In this paper, we investigate the robustness of promptable segmentation models to natural variations in bounding box prompts. First, we conduct a controlled user study and collect thousands of real bounding box annotations. Our analysis reveals substantial variability in segmentation quality across users for the same model and instance, indicating that SAM-like models are highly sensitive to natural prompt noise. Then, since exhaustive testing of all possible user inputs is computationally prohibitive, we reformulate robustness evaluation as a white-box optimization problem over the bounding box prompt space. We introduce BREPS, a method for generating adversarial bounding boxes that minimize or maximize segmentation error while adhering to naturalness constraints. Finally, we benchmark state-of-the-art models across 10 datasets, spanning everyday scenes to medical imaging. Code - https://github.com/emb-ai/BREPS.