C3-OWD: A Curriculum Cross-modal Contrastive Learning Framework for Open-World Detection

📅 2025-09-27

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

To address the dual challenges of poor generalization to unseen categories and low robustness under adverse conditions (e.g., low illumination, occlusion) in open-world object detection, this paper proposes a curriculum-based cross-modal contrastive learning framework—first integrating RGB-thermal (RGBT) multimodal perception with vision-language alignment. To mitigate catastrophic forgetting in two-stage training, exponential moving average (EMA) is adopted, providing theoretical guarantees for preserving prior knowledge. Jointly leveraging RGBT pretraining and cross-modal contrastive learning, our method simultaneously enhances category openness and environmental robustness. Extensive experiments on FLIR, OV-COCO, and OV-LVIS benchmarks yield 80.1 AP⁵⁰, 48.6 AP⁵⁰ₙₒᵥₑₗ, and 35.7 mAPᵣ, respectively—outperforming state-of-the-art methods by significant margins.

Technology Category

Application Category

📝 Abstract

Object detection has advanced significantly in the closed-set setting, but real-world deployment remains limited by two challenges: poor generalization to unseen categories and insufficient robustness under adverse conditions. Prior research has explored these issues separately: visible-infrared detection improves robustness but lacks generalization, while open-world detection leverages vision-language alignment strategy for category diversity but struggles under extreme environments. This trade-off leaves robustness and diversity difficult to achieve simultaneously. To mitigate these issues, we propose extbf{C3-OWD}, a curriculum cross-modal contrastive learning framework that unifies both strengths. Stage~1 enhances robustness by pretraining with RGBT data, while Stage~2 improves generalization via vision-language alignment. To prevent catastrophic forgetting between two stages, we introduce an Exponential Moving Average (EMA) mechanism that theoretically guarantees preservation of pre-stage performance with bounded parameter lag and function consistency. Experiments on FLIR, OV-COCO, and OV-LVIS demonstrate the effectiveness of our approach: C3-OWD achieves $80.1$ AP$^{50}$ on FLIR, $48.6$ AP$^{50}_{ ext{Novel}}$ on OV-COCO, and $35.7$ mAP$_r$ on OV-LVIS, establishing competitive performance across both robustness and diversity evaluations. Code available at: https://github.com/justin-herry/C3-OWD.git.

Problem

Research questions and friction points this paper is trying to address.

Improving object detection generalization to unseen categories

Enhancing detection robustness under adverse environmental conditions

Unifying robustness and diversity in open-world detection systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Curriculum cross-modal contrastive learning framework

RGBT pretraining for robustness enhancement

Vision-language alignment for generalization improvement

🔎 Similar Papers

Harnessing Shared Relations via Multimodal Mixup Contrastive Learning for Multimodal Classification