Align Your Query: Representation Alignment for Multimodality Medical Object Detection

📅 2025-10-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In medical object detection, joint training on multimodal images (e.g., X-ray, CT, MRI) suffers from degraded performance due to inter-modal statistical heterogeneity and discontinuous query representation spaces. To address this, we propose QueryREPA—a framework that achieves cross-modal query alignment without modifying the architecture of DETR-based detectors. Our approach comprises three key components: (1) lightweight modality tokens derived from textual descriptions; (2) a Multimodal Contextual Attention (MoCA) mechanism to enhance cross-modal query interaction; and (3) a contrastive learning–based pre-alignment strategy for query representations. QueryREPA significantly improves mean Average Precision (mAP) under multimodal joint training, incurs negligible computational overhead, requires no additional annotations, and effectively enhances model generalization and cross-modal consistency.

Technology Category

Application Category

📝 Abstract
Medical object detection suffers when a single detector is trained on mixed medical modalities (e.g., CXR, CT, MRI) due to heterogeneous statistics and disjoint representation spaces. To address this challenge, we turn to representation alignment, an approach that has proven effective for bringing features from different sources into a shared space. Specifically, we target the representations of DETR-style object queries and propose a simple, detector-agnostic framework to align them with modality context. First, we define modality tokens: compact, text-derived embeddings encoding imaging modality that are lightweight and require no extra annotations. We integrate the modality tokens into the detection process via Multimodality Context Attention (MoCA), mixing object-query representations via self-attention to propagate modality context within the query set. This preserves DETR-style architectures and adds negligible latency while injecting modality cues into object queries. We further introduce QueryREPA, a short pretraining stage that aligns query representations to their modality tokens using a task-specific contrastive objective with modality-balanced batches. Together, MoCA and QueryREPA produce modality-aware, class-faithful queries that transfer effectively to downstream training. Across diverse modalities trained altogether, the proposed approach consistently improves AP with minimal overhead and no architectural modifications, offering a practical path toward robust multimodality medical object detection. Project page: https://araseo.github.io/alignyourquery/.
Problem

Research questions and friction points this paper is trying to address.

Aligning heterogeneous medical imaging modalities for unified object detection
Addressing representation space disparities in multimodality medical data
Improving detector performance across mixed CXR CT MRI modalities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Aligns object queries with modality context
Uses modality tokens for imaging modality encoding
Integrates Multimodality Context Attention and QueryREPA
🔎 Similar Papers