🤖 AI Summary
This work proposes a novel approach that reformulates object detection as an image generation task, addressing the longstanding incompatibility between traditional detection methods and generative models. By leveraging Stable Diffusion, the authors introduce the first end-to-end conditional generative architecture that directly synthesizes colored bounding boxes—embedded with class semantics and spatial coordinates—in the original image space. Precise control over box generation is achieved through semantic constraints imposed in the latent space, effectively bridging the gap between generative and discriminative paradigms in computer vision. The method preserves the inherent flexibility of generative models while achieving detection accuracy comparable to that of state-of-the-art discriminative detectors.
📝 Abstract
This paper presents GenDet, a novel framework that redefines object detection as an image generation task. In contrast to traditional approaches, GenDet adopts a pioneering approach by leveraging generative modeling: it conditions on the input image and directly generates bounding boxes with semantic annotations in the original image space. GenDet establishes a conditional generation architecture built upon the large-scale pre-trained Stable Diffusion model, formulating the detection task as semantic constraints within the latent space. It enables precise control over bounding box positions and category attributes, while preserving the flexibility of the generative model. This novel methodology effectively bridges the gap between generative models and discriminative tasks, providing a fresh perspective for constructing unified visual understanding systems. Systematic experiments demonstrate that GenDet achieves competitive accuracy compared to discriminative detectors, while retaining the flexibility characteristic of generative methods.