🤖 AI Summary
To address the challenges of modeling noisy bounding boxes and the high computational cost of conventional diffusion models—typically requiring dozens of denoising steps—this paper pioneers the integration of Consistency Models (CMs) into object detection, proposing a “few-step denoising” paradigm. Specifically, detection is formulated as a rapid denoising process applied to noisy bounding boxes, where conditional consistency learning enables high-accuracy box recovery from random initializations in only 2–4 steps. The method comprises: (i) diffusion modeling in bounding-box space, (ii) controllable noise injection, (iii) conditional denoising training, and (iv) a self-consistent iterative refinement mechanism. Evaluated on MS-COCO and LVIS, our approach significantly outperforms state-of-the-art detectors in both accuracy and efficiency—achieving 3–5× faster inference while preserving or improving detection performance. The code is publicly available.
📝 Abstract
Object detection, a quintessential task in the realm of perceptual computing, can be tackled using a generative methodology. In the present study, we introduce a novel framework designed to articulate object detection as a denoising diffusion process, which operates on the perturbed bounding boxes of annotated entities. This framework, termed extbf{ConsistencyDet}, leverages an innovative denoising concept known as the Consistency Model. The hallmark of this model is its self-consistency feature, which empowers the model to map distorted information from any time step back to its pristine state, thereby realizing a extbf{``few-step denoising''} mechanism. Such an attribute markedly elevates the operational efficiency of the model, setting it apart from the conventional Diffusion Model. Throughout the training phase, ConsistencyDet initiates the diffusion sequence with noise-infused boxes derived from the ground-truth annotations and conditions the model to perform the denoising task. Subsequently, in the inference stage, the model employs a denoising sampling strategy that commences with bounding boxes randomly sampled from a normal distribution. Through iterative refinement, the model transforms an assortment of arbitrarily generated boxes into definitive detections. Comprehensive evaluations employing standard benchmarks, such as MS-COCO and LVIS, corroborate that ConsistencyDet surpasses other leading-edge detectors in performance metrics. Our code is available at https://anonymous.4open.science/r/ConsistencyDet-37D5.