Cross-DINO: Cross the Deep MLP and Transformer for Small Object Detection

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Small-object detection (SOD) suffers from sparse feature representations, insufficient contextual modeling, and low classification confidence. To address these challenges, this paper proposes a novel hybrid architecture integrating deep MLPs and Transformers. First, we design the Cross Coding Twice Module (CCTM), enabling bidirectional refinement between MLP and Transformer features to enhance fine-grained small-object representation. Second, we introduce Category-Size soft labels and Boost Loss to jointly optimize scale-aware classification. Our method achieves 36.4% AP on small objects in COCO—outperforming DINO by 4.4 percentage points—while requiring only 45M parameters and 12 training epochs, significantly surpassing existing DETR-based approaches. The core contributions are: (i) the first CCTM mechanism for cross-modal feature interaction between MLP and Transformer streams, and (ii) a scale-adaptive classification optimization strategy leveraging soft label regularization and loss reweighting.

Technology Category

Application Category

📝 Abstract

Small Object Detection (SOD) poses significant challenges due to limited information and the model's low class prediction score. While Transformer-based detectors have shown promising performance, their potential for SOD remains largely unexplored. In typical DETR-like frameworks, the CNN backbone network, specialized in aggregating local information, struggles to capture the necessary contextual information for SOD. The multiple attention layers in the Transformer Encoder face difficulties in effectively attending to small objects and can also lead to blurring of features. Furthermore, the model's lower class prediction score of small objects compared to large objects further increases the difficulty of SOD. To address these challenges, we introduce a novel approach called Cross-DINO. This approach incorporates the deep MLP network to aggregate initial feature representations with both short and long range information for SOD. Then, a new Cross Coding Twice Module (CCTM) is applied to integrate these initial representations to the Transformer Encoder feature, enhancing the details of small objects. Additionally, we introduce a new kind of soft label named Category-Size (CS), integrating the Category and Size of objects. By treating CS as new ground truth, we propose a new loss function called Boost Loss to improve the class prediction score of the model. Extensive experimental results on COCO, WiderPerson, VisDrone, AI-TOD, and SODA-D datasets demonstrate that Cross-DINO efficiently improves the performance of DETR-like models on SOD. Specifically, our model achieves 36.4% APs on COCO for SOD with only 45M parameters, outperforming the DINO by +4.4% APS (36.4% vs. 32.0%) with fewer parameters and FLOPs, under 12 epochs training setting. The source codes will be available at https://github.com/Med-Process/Cross-DINO.

Problem

Research questions and friction points this paper is trying to address.

Improves small object detection using Cross-DINO approach

Enhances feature details with Cross Coding Twice Module

Boosts class prediction scores with Category-Size soft labels

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines deep MLP and Transformer for feature aggregation

Introduces Cross Coding Twice Module for detail enhancement

Proposes Category-Size soft label with Boost Loss

🔎 Similar Papers

SimPLR: A Simple and Plain Transformer for Efficient Object Detection and Segmentation