YOLO-MS: Rethinking Multi-Scale Representation Learning for Real-time Object Detection

📅 2023-08-10

🏛️ IEEE Transactions on Pattern Analysis and Machine Intelligence

📈 Citations: 33

✨ Influential: 4

career value

201K/year

🤖 AI Summary

To address insufficient multi-scale feature representation in real-time object detection, this paper proposes YOLO-MS—a lightweight, end-to-end trainable multi-scale collaborative feature enhancement framework. Methodologically, it introduces a multi-branch base module coupled with a multi-kernel convolutional fusion structure to reformulate cross-scale feature learning; incorporates an end-to-end joint optimization mechanism enabling training from scratch without ImageNet pretraining; and features plug-and-play compatibility for seamless integration into mainstream YOLO architectures. Experimentally, YOLO-MS-XS achieves 42.1% AP on COCO, outperforming RTMDet by 2.0 points. When deployed as a plug-in module, it elevates YOLOv8-N’s AP, APₗ, and APₘ to 20.3%, 55.1%, and 40.6%, respectively—while reducing both parameter count and FLOPs. These results demonstrate YOLO-MS’s effectiveness in enhancing multi-scale feature learning with minimal computational overhead.

📝 Abstract

We aim at providing the object detection community with an efficient and performant object detector, termed YOLO-MS. The core design is based on a series of investigations on how multi-branch features of the basic block and convolutions with different kernel sizes affect the detection performance of objects at different scales. The outcome is a new strategy that can significantly enhance multi-scale feature representations of real-time object detectors. To verify the effectiveness of our work, we train our YOLO-MS on the MS COCO dataset from scratch without relying on any other large-scale datasets, like ImageNet or pre-trained weights. Without bells and whistles, our YOLO-MS outperforms the recent state-of-the-art real-time object detectors, including YOLO-v7, RTMDet, and YOLO-v8. Taking the XS version of YOLO-MS as an example, it can achieve an AP score of 42+% on MS COCO, which is about 2% higher than RTMDet with the same model size. Furthermore, our work can also serve as a plug-and-play module for other YOLO models. Typically, our method significantly advances the APs, APl, and AP of YOLOv8-N from 18%+, 52%+, and 37%+ to 20%+, 55%+, and 40%+, respectively, with even fewer parameters and MACs. Code and trained models are publicly available at https://github.com/FishAndWasabi/YOLO-MS. We also provide the Jittor version at https://github.com/NK-JittorCV/nk-yolo.

Problem

Research questions and friction points this paper is trying to address.

Enhance multi-scale feature representations

Improve real-time object detection

Outperform existing YOLO models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-scale feature enhancement strategy

Training from scratch on MS COCO

Plug-and-play module for YOLO models

🔎 Similar Papers

SimPLR: A Simple and Plain Transformer for Efficient Object Detection and Segmentation