OverLoCK: An Overview-first-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels

📅 2025-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing ConvNets predominantly enlarge convolutional kernels to expand receptive fields, overlooking the top-down “overview-first, then closely inspect” attention mechanism inherent in human vision—limiting performance gains. This paper proposes the Overview-first-Look-Closely-next (OverLoCK) paradigm: a purely convolutional backbone that, for the first time, enables dynamic contextual injection at both feature and weight levels. Key innovations include: (i) Depth-wise Stage Decomposition (DDS), decoupling global and local modeling; (ii) Context-Mixing Dynamic Convolution (ContMix), jointly encoding long-range dependencies and local inductive bias; and (iii) dynamic kernel weight modulation with multi-scale semantic fusion. Experiments demonstrate that OverLoCK-T achieves 84.2% Top-1 accuracy on ImageNet-1K—surpassing ConvNeXt-B with only one-third the computational cost. In downstream tasks, it improves AP by 1.0% in Cascade Mask R-CNN and boosts mIoU by 1.7% over UniRepLKNet-T in UperNet-based semantic segmentation.

Technology Category

Application Category

📝 Abstract
In the human vision system, top-down attention plays a crucial role in perception, wherein the brain initially performs an overall but rough scene analysis to extract salient cues (i.e., overview first), followed by a finer-grained examination to make more accurate judgments (i.e., look closely next). However, recent efforts in ConvNet designs primarily focused on increasing kernel size to obtain a larger receptive field without considering this crucial biomimetic mechanism to further improve performance. To this end, we propose a novel pure ConvNet vision backbone, termed OverLoCK, which is carefully devised from both the architecture and mixer perspectives. Specifically, we introduce a biomimetic Deep-stage Decomposition Strategy (DDS) that fuses semantically meaningful context representations into middle and deep layers by providing dynamic top-down context guidance at both feature and kernel weight levels. To fully unleash the power of top-down context guidance, we further propose a novel extbf{Cont}ext- extbf{Mix}ing Dynamic Convolution (ContMix) that effectively models long-range dependencies while preserving inherent local inductive biases even when the input resolution increases. These properties are absent in previous convolutions. With the support from both DDS and ContMix, our OverLoCK exhibits notable performance improvement over existing methods. For instance, OverLoCK-T achieves a Top-1 accuracy of 84.2%, significantly surpassing ConvNeXt-B while only using around one-third of the FLOPs/parameters. On object detection with Cascade Mask R-CNN, our OverLoCK-S surpasses MogaNet-B by a significant 1% in AP$^b$. On semantic segmentation with UperNet, our OverLoCK-T remarkably improves UniRepLKNet-T by 1.7% in mIoU. Code is publicly available at https://github.com/LMMMEng/OverLoCK.
Problem

Research questions and friction points this paper is trying to address.

Enhance ConvNet with biomimetic vision mechanisms
Integrate dynamic top-down context guidance
Improve performance in vision tasks efficiently
Innovation

Methods, ideas, or system contributions that make the work stand out.

Biomimetic Deep-stage Decomposition Strategy
Context-Mixing Dynamic Convolution
Top-down context guidance integration