Towards Domain-Generalized Open-Vocabulary Object Detection: A Progressive Domain-invariant Cross-modal Alignment Method

๐Ÿ“… 2026-03-29
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the significant performance degradation of open-vocabulary object detection under domain shift, which stems from the fragility of visionโ€“language coupling. It formally introduces the task of domain-generalized open-vocabulary detection and reveals that domain shift induces a collapse of the cross-modal semantic space. To mitigate this, the authors propose Progressive Invariant Cross-modal Alignment (PICA), a novel approach that dynamically strengthens cross-domain modality alignment through multi-level curriculum learning. PICA integrates adaptive pseudo-word prototypes, sample confidence filtering, and visual consistency constraints to progressively align representations across domains. Experiments demonstrate that PICA substantially enhances detection robustness on out-of-distribution domains, establishing a new benchmark and methodological foundation for building generalizable open-vocabulary systems in real-world scenarios.
๐Ÿ“ Abstract
Open-Vocabulary Object Detection (OVOD) has achieved remarkable success in generalizing to novel categories. However, this success often rests on the implicit assumption of domain stationarity. In this work, we provide a principled revisit of the OVOD paradigm, uncovering a fundamental vulnerability: the fragile coupling between visual manifolds and textual embeddings when distribution shifts occur. We first systematically formalize Domain-Generalized Open-Vocabulary Object Detection (DG-OVOD). Through empirical analysis, we demonstrate that visual shifts do not merely add noise; they cause a collapse of the latent cross-modal space where novel category visual signals detach from their semantic anchors. Motivated by these insights, we propose Progressive Domain-invariant Cross-modal Alignment (PICA). PICA departs from uniform training by introducing a multi-level ambiguity and signal strength curriculum. It builds adaptive pseudo-word prototypes, refined via sample confidence and visual consistency, to enforce invariant cross-domain modality alignment. Our findings suggest that OVOD's robustness to domain shifts is intrinsically linked to the stability of the latent cross-modal alignment space. Our work provides both a challenging benchmark and a new perspective on building truly generalizable open-vocabulary systems that extend beyond static laboratory conditions.
Problem

Research questions and friction points this paper is trying to address.

Open-Vocabulary Object Detection
Domain Generalization
Cross-modal Alignment
Distribution Shift
Visual-Textual Coupling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Domain-Generalized OVOD
Cross-modal Alignment
Pseudo-word Prototypes
Distribution Shift
Curriculum Learning
๐Ÿ”Ž Similar Papers
No similar papers found.
Xiaoran Xu
Xiaoran Xu
USF
Lung soundLLMHealthcareMachine learningBiomedical
X
Xiaoshan Yang
MAIS, Institute of Automation, Chinese Academy of Sciences, Beijing, China
J
Jiangang Yang
Institute of Microelectronics, University of Chinese Academy of Sciences, Beijing, China
Y
Yifan Xu
MAIS, Institute of Automation, Chinese Academy of Sciences, Beijing, China
Jian Liu
Jian Liu
Institute of Information Engineering, CAS
Software testing and security
Changsheng Xu
Changsheng Xu
Professor, Institute of Automation, Chinese Academy of Sciences
MultimediaComputer vision