TinyFormer: Preserving Tiny Objects in YOLO-DETRHybridReal-time Detectors

πŸ“… 2026-05-24
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitations of existing YOLO- and DETR-based detectors in tiny object detection, where excessive downsampling in YOLO leads to information loss and coarse tokenization in DETR causes matching neglect. To overcome these issues, we propose TinyFormerβ€”a real-time detector that synergistically integrates the strengths of both paradigms by combining a Vision Transformer (ViT) backbone, a YOLO-style feature pyramid, and DETR-style set prediction without non-maximum suppression (NMS). We further introduce a Parallel Bilateral Fusion Module (PBM) to preserve high-resolution spatial details from shallow layers and a Spatial-Semantic Adapter (SSA) to mitigate spatial information loss induced by tokenization. Experiments show that TinyFormer-X achieves 58.4% AP on MS COCO, with PBM yielding a 1.6% AP gain on tiny objects; after pretraining on Objects365, it attains 60.2% AP, surpassing state-of-the-art methods with fewer parameters and lower computational cost.
πŸ“ Abstract
YOLO-series and DETR-based detectors struggle with tiny-object detection. YOLO-style models benefit from efficient dense prediction, but their large-stride backbones may suppress tiny instances in deep feature maps and make grid assignment ambiguous. DETR-based models remove hand-crafted post-processing through set prediction, yet they reason over coarse token grids, where tiny objects occupy only a few weak tokens and are easily overlooked during matching. To address these limitations, we propose TinyFormer, a unified YOLO--DETR hybrid real-time detector that combines ViT representations, NMS-free set prediction, and a YOLO-style pyramid neck for accurate small-object detection. TinyFormer introduces a Parallel Bi-fusion Module (PBM), which builds high-resolution shortcuts from shallow stages to the feature pyramid, preserving fine spatial details during multi-scale fusion. We further design a Spatial Semantic Adapter (SSA) to compensate for the spatial loss caused by coarse tokenization. SSA extracts high-resolution cues from early stages and injects them into transformer token embeddings, improving tiny-object localization without sacrificing the global modeling ability of DETR. Experiments on MS COCO show that TinyFormer consistently outperforms recent YOLO-series detectors and the strong DEIMv2 baseline. TinyFormer-X achieves 58.4% AP even without PBM, while adding PBM improves the overall AP to 58.5% and brings a 1.6% AP gain on small objects. With Objects365 pre-training, TinyFormer-X-PBM reaches 60.2% AP, surpassing RF-DETR and other Objects365-pretrained detectors with fewer parameters and lower computation. These results demonstrate that TinyFormer bridges dense YOLO-style feature fusion and DETR-style set prediction, providing a strong accuracy-efficiency trade-off for real-time tiny-object detection. Code is available at https://github.com/mmpmmpmmpjosh/TinyFormer.
Problem

Research questions and friction points this paper is trying to address.

tiny-object detection
YOLO
DETR
real-time detection
feature fusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

TinyFormer
Parallel Bi-fusion Module
Spatial Semantic Adapter
YOLO-DETR hybrid
tiny object detection
πŸ”Ž Similar Papers
2024-06-09IEEE Transactions on Geoscience and Remote SensingCitations: 18