C3-OWD: A Curriculum Cross-modal Contrastive Learning Framework for Open-World Detection

📅 2025-09-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the dual challenges of poor generalization to unseen categories and low robustness under adverse conditions (e.g., low illumination, occlusion) in open-world object detection, this paper proposes a curriculum-based cross-modal contrastive learning framework—first integrating RGB-thermal (RGBT) multimodal perception with vision-language alignment. To mitigate catastrophic forgetting in two-stage training, exponential moving average (EMA) is adopted, providing theoretical guarantees for preserving prior knowledge. Jointly leveraging RGBT pretraining and cross-modal contrastive learning, our method simultaneously enhances category openness and environmental robustness. Extensive experiments on FLIR, OV-COCO, and OV-LVIS benchmarks yield 80.1 AP⁵⁰, 48.6 AP⁵⁰ₙₒᵥₑₗ, and 35.7 mAPᵣ, respectively—outperforming state-of-the-art methods by significant margins.

Technology Category

Application Category

📝 Abstract
Object detection has advanced significantly in the closed-set setting, but real-world deployment remains limited by two challenges: poor generalization to unseen categories and insufficient robustness under adverse conditions. Prior research has explored these issues separately: visible-infrared detection improves robustness but lacks generalization, while open-world detection leverages vision-language alignment strategy for category diversity but struggles under extreme environments. This trade-off leaves robustness and diversity difficult to achieve simultaneously. To mitigate these issues, we propose extbf{C3-OWD}, a curriculum cross-modal contrastive learning framework that unifies both strengths. Stage~1 enhances robustness by pretraining with RGBT data, while Stage~2 improves generalization via vision-language alignment. To prevent catastrophic forgetting between two stages, we introduce an Exponential Moving Average (EMA) mechanism that theoretically guarantees preservation of pre-stage performance with bounded parameter lag and function consistency. Experiments on FLIR, OV-COCO, and OV-LVIS demonstrate the effectiveness of our approach: C3-OWD achieves $80.1$ AP$^{50}$ on FLIR, $48.6$ AP$^{50}_{ ext{Novel}}$ on OV-COCO, and $35.7$ mAP$_r$ on OV-LVIS, establishing competitive performance across both robustness and diversity evaluations. Code available at: https://github.com/justin-herry/C3-OWD.git.
Problem

Research questions and friction points this paper is trying to address.

Improving object detection generalization to unseen categories
Enhancing detection robustness under adverse environmental conditions
Unifying robustness and diversity in open-world detection systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Curriculum cross-modal contrastive learning framework
RGBT pretraining for robustness enhancement
Vision-language alignment for generalization improvement
🔎 Similar Papers
No similar papers found.
S
Siheng Wang
Jiangsu University
Zhengdao Li
Zhengdao Li
The Chinese University of Hong Kong, Shenzhen
Machine learning on Graphsgraph representation learning
Yanshu Li
Yanshu Li
Brown University
NLPMultimodal Learning
C
Canran Xiao
Central South University
H
Haibo Zhan
Jiangsu University
Z
Zhengtao Yao
University of Southern California
X
Xuzhi Zhang
University of Southern California
J
Jiale Kang
Yuanshi Inc.
L
Linshan Li
Jiangsu University
W
Weiming Liu
Zhejiang University
Z
Zhikang Dong
State University of New York at Stony Brook
Jifeng Shen
Jifeng Shen
Jiangsu University
Computer Vision
J
Junhao Dong
Nanyang Technological University
Q
Qiang Sun
University of Toronto
Piotr Koniusz
Piotr Koniusz
Principal Scientist (Data61❤CSIRO). Hon./Adj. Associate Professor (level D) (ANU & UNSW).
Computer VisionMachine LearningRecognitionTensor and Kernel MethodsNeural Networks