Towards Domain-Generalized Open-Vocabulary Object Detection: A Progressive Domain-invariant Cross-modal Alignment Method

๐Ÿ“… 2026-03-29
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

227K/year
๐Ÿค– AI Summary
This work addresses the significant performance degradation of open-vocabulary object detection under domain shift, which stems from the fragility of visionโ€“language coupling. It formally introduces the task of domain-generalized open-vocabulary detection and reveals that domain shift induces a collapse of the cross-modal semantic space. To mitigate this, the authors propose Progressive Invariant Cross-modal Alignment (PICA), a novel approach that dynamically strengthens cross-domain modality alignment through multi-level curriculum learning. PICA integrates adaptive pseudo-word prototypes, sample confidence filtering, and visual consistency constraints to progressively align representations across domains. Experiments demonstrate that PICA substantially enhances detection robustness on out-of-distribution domains, establishing a new benchmark and methodological foundation for building generalizable open-vocabulary systems in real-world scenarios.

Technology Category

Application Category

๐Ÿ“ Abstract
Open-Vocabulary Object Detection (OVOD) has achieved remarkable success in generalizing to novel categories. However, this success often rests on the implicit assumption of domain stationarity. In this work, we provide a principled revisit of the OVOD paradigm, uncovering a fundamental vulnerability: the fragile coupling between visual manifolds and textual embeddings when distribution shifts occur. We first systematically formalize Domain-Generalized Open-Vocabulary Object Detection (DG-OVOD). Through empirical analysis, we demonstrate that visual shifts do not merely add noise; they cause a collapse of the latent cross-modal space where novel category visual signals detach from their semantic anchors. Motivated by these insights, we propose Progressive Domain-invariant Cross-modal Alignment (PICA). PICA departs from uniform training by introducing a multi-level ambiguity and signal strength curriculum. It builds adaptive pseudo-word prototypes, refined via sample confidence and visual consistency, to enforce invariant cross-domain modality alignment. Our findings suggest that OVOD's robustness to domain shifts is intrinsically linked to the stability of the latent cross-modal alignment space. Our work provides both a challenging benchmark and a new perspective on building truly generalizable open-vocabulary systems that extend beyond static laboratory conditions.
Problem

Research questions and friction points this paper is trying to address.

Open-Vocabulary Object Detection
Domain Generalization
Cross-modal Alignment
Distribution Shift
Visual-Textual Coupling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Domain-Generalized OVOD
Cross-modal Alignment
Pseudo-word Prototypes
Distribution Shift
Curriculum Learning
๐Ÿ”Ž Similar Papers
No similar papers found.