🤖 AI Summary
Current vision models often underperform in rare scenarios due to their overreliance on common contextual patterns. This work proposes DecoupleGen, a method that decouples object content from contextual priors by personalizing text-to-image diffusion models. Operating under the constraint of preserving data distribution consistency, DecoupleGen generates semantically plausible and diverse images of objects in rare contexts. To ensure the relevance of the synthesized data, the approach incorporates a validation mechanism that enforces semantic fidelity. Experimental results demonstrate that DecoupleGen significantly improves object classification and recognition performance in complex, contextually atypical settings, outperforming existing data augmentation strategies. The study further highlights the critical role of decoupled generation coupled with validation in mitigating contextual bias in visual recognition systems.
📝 Abstract
Different visual patterns appear with different frequencies in the world: e.g., beach balls appear on sand more often than they do on a road. These statistics are reflected in vision datasets, and as a result trained models more easily recognize objects in common scenarios. However, recognizing a beach ball on a road may arguably be even more important than recognizing it on sand. We study how to mitigate this discrepancy. Since collecting uncommon images in the real world may be difficult, we explore whether generating images with less frequent contexts can serve as effective training augmentation. A key challenge is guiding generations to remain close to the original dataset distribution while creating diverse images with uncommon contexts. We introduce Decoupling Contextual Patterns with Generations (DecoupleGen), a method that personalizes text-to-image diffusion models to facilitate coherent synthesis of images with rare contexts while preserving original visual details. The generated images contain semantically meaningful content and remain visually aligned with the original datasets. We further apply verification constraints to ensure relevance of the augmented data. We evaluate our approach on object classification and recognition tasks on complex scene datasets. Our experiments demonstrate consistent improvements over previous approaches, and our analyses identify factors underlying these improvements.