π€ AI Summary
This work addresses the limitations of existing YOLO- and DETR-based detectors in tiny object detection, where excessive downsampling in YOLO leads to information loss and coarse tokenization in DETR causes matching neglect. To overcome these issues, we propose TinyFormerβa real-time detector that synergistically integrates the strengths of both paradigms by combining a Vision Transformer (ViT) backbone, a YOLO-style feature pyramid, and DETR-style set prediction without non-maximum suppression (NMS). We further introduce a Parallel Bilateral Fusion Module (PBM) to preserve high-resolution spatial details from shallow layers and a Spatial-Semantic Adapter (SSA) to mitigate spatial information loss induced by tokenization. Experiments show that TinyFormer-X achieves 58.4% AP on MS COCO, with PBM yielding a 1.6% AP gain on tiny objects; after pretraining on Objects365, it attains 60.2% AP, surpassing state-of-the-art methods with fewer parameters and lower computational cost.
π Abstract
YOLO-series and DETR-based detectors struggle with tiny-object detection. YOLO-style models benefit from efficient dense prediction, but their large-stride backbones may suppress tiny instances in deep feature maps and make grid assignment ambiguous. DETR-based models remove hand-crafted post-processing through set prediction, yet they reason over coarse token grids, where tiny objects occupy only a few weak tokens and are easily overlooked during matching. To address these limitations, we propose TinyFormer, a unified YOLO--DETR hybrid real-time detector that combines ViT representations, NMS-free set prediction, and a YOLO-style pyramid neck for accurate small-object detection. TinyFormer introduces a Parallel Bi-fusion Module (PBM), which builds high-resolution shortcuts from shallow stages to the feature pyramid, preserving fine spatial details during multi-scale fusion. We further design a Spatial Semantic Adapter (SSA) to compensate for the spatial loss caused by coarse tokenization. SSA extracts high-resolution cues from early stages and injects them into transformer token embeddings, improving tiny-object localization without sacrificing the global modeling ability of DETR. Experiments on MS COCO show that TinyFormer consistently outperforms recent YOLO-series detectors and the strong DEIMv2 baseline. TinyFormer-X achieves 58.4% AP even without PBM, while adding PBM improves the overall AP to 58.5% and brings a 1.6% AP gain on small objects. With Objects365 pre-training, TinyFormer-X-PBM reaches 60.2% AP, surpassing RF-DETR and other Objects365-pretrained detectors with fewer parameters and lower computation. These results demonstrate that TinyFormer bridges dense YOLO-style feature fusion and DETR-style set prediction, providing a strong accuracy-efficiency trade-off for real-time tiny-object detection. Code is available at https://github.com/mmpmmpmmpjosh/TinyFormer.