Cross-domain Few-shot Object Detection with Multi-modal Textual Enrichment

📅 2025-02-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the significant performance degradation in cross-domain few-shot object detection caused by domain shift, this paper proposes a text-semantics-enhanced meta-learning approach. The method jointly optimizes domain adaptation and few-shot generalization within a unified meta-learning framework. Its key contributions are: (1) a novel bidirectional text-feature-guided semantic correction mechanism that dynamically rectifies semantic discrepancies in visual features; and (2) a vision-language aligned multimodal aggregation module enabling fine-grained fusion of cross-modal representations. Evaluated on mainstream cross-domain few-shot detection benchmarks, the approach achieves an 8.2% improvement in mean Average Precision (mAP), demonstrating substantial gains in model robustness and cross-domain generalization capability.

Technology Category

Application Category

📝 Abstract
Advancements in cross-modal feature extraction and integration have significantly enhanced performance in few-shot learning tasks. However, current multi-modal object detection (MM-OD) methods often experience notable performance degradation when encountering substantial domain shifts. We propose that incorporating rich textual information can enable the model to establish a more robust knowledge relationship between visual instances and their corresponding language descriptions, thereby mitigating the challenges of domain shift. Specifically, we focus on the problem of Cross-Domain Multi-Modal Few-Shot Object Detection (CDMM-FSOD) and introduce a meta-learning-based framework designed to leverage rich textual semantics as an auxiliary modality to achieve effective domain adaptation. Our new architecture incorporates two key components: (i) A multi-modal feature aggregation module, which aligns visual and linguistic feature embeddings to ensure cohesive integration across modalities. (ii) A rich text semantic rectification module, which employs bidirectional text feature generation to refine multi-modal feature alignment, thereby enhancing understanding of language and its application in object detection. We evaluate the proposed method on common cross-domain object detection benchmarks and demonstrate that it significantly surpasses existing few-shot object detection approaches.
Problem

Research questions and friction points this paper is trying to address.

Cross-domain few-shot object detection
Multi-modal textual enrichment
Domain shift mitigation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal feature aggregation module
Rich text semantic rectification module
Meta-learning-based framework
🔎 Similar Papers
No similar papers found.
Zeyu Shangguan
Zeyu Shangguan
University of Southern California
Artificial IntelligenceRobotics
Daniel Seita
Daniel Seita
University of Southern California
RoboticsMachine Learning
M
Mohammad Rostami
Department of Computer Science, University of Southern California, Los Angeles, CA, USA