🤖 AI Summary
In medical object detection, joint training on multimodal images (e.g., X-ray, CT, MRI) suffers from degraded performance due to inter-modal statistical heterogeneity and discontinuous query representation spaces. To address this, we propose QueryREPA—a framework that achieves cross-modal query alignment without modifying the architecture of DETR-based detectors. Our approach comprises three key components: (1) lightweight modality tokens derived from textual descriptions; (2) a Multimodal Contextual Attention (MoCA) mechanism to enhance cross-modal query interaction; and (3) a contrastive learning–based pre-alignment strategy for query representations. QueryREPA significantly improves mean Average Precision (mAP) under multimodal joint training, incurs negligible computational overhead, requires no additional annotations, and effectively enhances model generalization and cross-modal consistency.
📝 Abstract
Medical object detection suffers when a single detector is trained on mixed medical modalities (e.g., CXR, CT, MRI) due to heterogeneous statistics and disjoint representation spaces. To address this challenge, we turn to representation alignment, an approach that has proven effective for bringing features from different sources into a shared space. Specifically, we target the representations of DETR-style object queries and propose a simple, detector-agnostic framework to align them with modality context. First, we define modality tokens: compact, text-derived embeddings encoding imaging modality that are lightweight and require no extra annotations. We integrate the modality tokens into the detection process via Multimodality Context Attention (MoCA), mixing object-query representations via self-attention to propagate modality context within the query set. This preserves DETR-style architectures and adds negligible latency while injecting modality cues into object queries. We further introduce QueryREPA, a short pretraining stage that aligns query representations to their modality tokens using a task-specific contrastive objective with modality-balanced batches. Together, MoCA and QueryREPA produce modality-aware, class-faithful queries that transfer effectively to downstream training. Across diverse modalities trained altogether, the proposed approach consistently improves AP with minimal overhead and no architectural modifications, offering a practical path toward robust multimodality medical object detection. Project page: https://araseo.github.io/alignyourquery/.