🤖 AI Summary
Existing lightweight vision networks rely solely on either self-attention or standard convolutions, struggling to balance accuracy and efficiency under constrained computational budgets. To address this, we propose a “see broadly, focus finely” design paradigm and introduce the novel LS convolution—decoupling and jointly optimizing large-kernel wide-field perception and small-kernel precise feature aggregation, inspired by human multiscale dynamic vision. We further integrate multi-scale feature decomposition with hardware-efficient operators to construct an effective CNN architecture. Our model achieves state-of-the-art performance on ImageNet, COCO, and ADE20K, surpassing MobileNetV3, EfficientNet-Lite, and TinyViT: it delivers an average +1.8% Top-1 accuracy and +12% inference speedup at comparable FLOPs. Notably, this work is the first to explicitly decouple and jointly optimize perceptual scale and aggregation scale within lightweight models.
📝 Abstract
Vision network designs, including Convolutional Neural Networks and Vision Transformers, have significantly advanced the field of computer vision. Yet, their complex computations pose challenges for practical deployments, particularly in real-time applications. To tackle this issue, researchers have explored various lightweight and efficient network designs. However, existing lightweight models predominantly leverage self-attention mechanisms and convolutions for token mixing. This dependence brings limitations in effectiveness and efficiency in the perception and aggregation processes of lightweight networks, hindering the balance between performance and efficiency under limited computational budgets. In this paper, we draw inspiration from the dynamic heteroscale vision ability inherent in the efficient human vision system and propose a ``See Large, Focus Small'' strategy for lightweight vision network design. We introduce LS ( extbf{L}arge- extbf{S}mall) convolution, which combines large-kernel perception and small-kernel aggregation. It can efficiently capture a wide range of perceptual information and achieve precise feature aggregation for dynamic and complex visual representations, thus enabling proficient processing of visual information. Based on LS convolution, we present LSNet, a new family of lightweight models. Extensive experiments demonstrate that LSNet achieves superior performance and efficiency over existing lightweight networks in various vision tasks. Codes and models are available at https://github.com/jameslahm/lsnet.