🤖 AI Summary
Fine-grained object detection—e.g., vehicle damage assessment—faces challenges from strong contextual dependencies and insufficient local feature modeling. To address this, we propose ContextDiff, a detection framework that jointly leverages global scene understanding and generative denoising. Methodologically, it adopts a conditional diffusion detection paradigm, incorporating a dedicated global context encoder and an end-to-end generative denoising training strategy. Its core innovation is a context-aware fusion module that employs cross-attention to dynamically integrate local proposal features with independently encoded global scene representations—thereby alleviating conventional conditional diffusion models’ overreliance on local features. Evaluated on the CarDD benchmark, ContextDiff achieves a 3.2% mAP improvement over prior state-of-the-art methods, establishing a new benchmark for fine-grained detection in complex, context-rich scenes.
📝 Abstract
Fine-grained object detection in challenging visual domains, such as vehicle damage assessment, presents a formidable challenge even for human experts to resolve reliably. While DiffusionDet has advanced the state-of-the-art through conditional denoising diffusion, its performance remains limited by local feature conditioning in context-dependent scenarios. We address this fundamental limitation by introducing Context-Aware Fusion (CAF), which leverages cross-attention mechanisms to integrate global scene context with local proposal features directly. The global context is generated using a separate dedicated encoder that captures comprehensive environmental information, enabling each object proposal to attend to scene-level understanding. Our framework significantly enhances the generative detection paradigm by enabling each object proposal to attend to comprehensive environmental information. Experimental results demonstrate an improvement over state-of-the-art models on the CarDD benchmark, establishing new performance benchmarks for context-aware object detection in fine-grained domains