🤖 AI Summary
Visual grouping tasks—such as instance segmentation, object detection, and referring expression grounding—suffer from high annotation costs, coverage bias, and poor generalization due to reliance on scarce real-world labels; existing synthetic data generation methods lack flexibility, geometric fidelity, and compositional diversity. To address this, we propose SOS: an object-centric, scalable synthetic data generation framework that integrates structured layout prior modeling, generative relighting, and high-fidelity mask embedding to enable controllable synthesis of fine-grained, diverse instance masks, bounding boxes, and referring expressions. SOS is the first method to support targeted synthesis of challenging cases—e.g., intra-class referring—and significantly improves generalization under few-shot and closed-vocabulary settings. Experiments show substantial gains: +10.9 AP on LVIS detection, +8.4 $N_{ ext{Acc}}$ on gRefCOCO grounding accuracy; with only 1% COCO real data, SOS achieves +6.59 AP and +3.83 $AP_{ ext{rare}}$ for rare-class segmentation.
📝 Abstract
Visual grouping -- operationalized via instance segmentation, visual grounding, and object detection -- underpins applications from robotic perception to photo editing. Large annotated datasets are costly, biased in coverage, and hard to scale. Synthetic data are promising but often lack flexibility, accuracy, and compositional diversity.
We present SOS, a simple and scalable data synthesis pipeline based on an object-centric composition strategy. It pastes high-quality synthetic object segments into new images using structured layout priors and generative relighting, producing accurate and diverse masks, boxes, and referring expressions. Models trained on 100000 synthetic images from SOS outperform those trained on larger real-image datasets such as GRIT (20M) and V3Det (200K) on detection and grounding tasks, achieving +10.9 AP on LVIS detection and +8.4 $N_{ ext{Acc}}$ on gRefCOCO grounding. SOS enables controllable dataset construction and improves generalization in both low-data and closed-vocabulary settings. Augmenting LVIS and COCO with synthetic object segments yields strong performance across real-data scales and even larger gains under extremely limited real data (for example, +3.83 $AP_{ ext{rare}}$ on LVIS instance segmentation and +6.59 AP with a 1 percent COCO setup). This controllability also supports targeted data generation for challenging intra-class referring in visual grounding.