PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training

📅 2026-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges in open-vocabulary object detection, where textual representations struggle to align with complex visual concepts and scarce image-text paired data for rare categories limits performance in specialized domains and intricate scenes. To overcome these limitations, the authors propose PET-DINO, a unified detector built upon Grounding DINO, which incorporates an Alignment-Friendly Visual Prompt Generation (AFVPG) module. Furthermore, they introduce two prompt-augmented training strategies: Intra-Batch Parallel Prompting (IBP) at the iteration level and Dynamic Memory-Driven Prompting (DMD) at the training level, enabling parallel modeling across multiple prompt pathways. This approach harmonizes visual cues and streamlines the multimodal architecture, achieving significant performance gains across various zero-shot detection benchmarks.
📝 Abstract
Open-Set Object Detection (OSOD) enables recognition of novel categories beyond fixed classes but faces challenges in aligning text representations with complex visual concepts and the scarcity of image-text pairs for rare categories. This results in suboptimal performance in specialized domains or with complex objects. Recent visual-prompted methods partially address these issues but often involve complex multi-modal designs and multi-stage optimizations, prolonging the development cycle. Additionally, effective training strategies for data-driven OSOD models remain largely unexplored. To address these challenges, we propose PET-DINO, a universal detector supporting both text and visual prompts. Our Alignment-Friendly Visual Prompt Generation (AFVPG) module builds upon an advanced text-prompted detector, addressing the limitations of text representation guidance and reducing the development cycle. We introduce two prompt-enriched training strategies: Intra-Batch Parallel Prompting (IBP) at the iteration level and Dynamic Memory-Driven Prompting (DMD) at the overall training level. These strategies enable simultaneous modeling of multiple prompt routes, facilitating parallel alignment with diverse real-world usage scenarios. Comprehensive experiments demonstrate that PET-DINO exhibits competitive zero-shot object detection capabilities across various prompt-based detection protocols. These strengths can be attributed to inheritance-based philosophy and prompt-enriched training strategies, which play a critical role in building an effective generic object detector. Project page: https://fuweifuvtoo.github.io/pet-dino.
Problem

Research questions and friction points this paper is trying to address.

Open-Set Object Detection
text-visual alignment
data scarcity
prompt-based detection
training strategy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prompt-Enriched Training
Visual Prompting
Open-Set Object Detection
Zero-Shot Detection
Alignment-Friendly Prompt Generation
🔎 Similar Papers
No similar papers found.