🤖 AI Summary
This work addresses the challenges of object detection in remote sensing imagery, where targets exhibit diverse geometric shapes and large scale variations, making it difficult for existing methods to uniformly model both elongated and regular-shaped objects. To this end, the authors propose a unified multi-kernel convolutional architecture that, for the first time, integrates anisotropic axial strip convolutions with isotropic square convolutions within a single backbone network. This design enables the construction of multi-scale receptive fields that effectively capture both fine local details and long-range contextual information. The paper further introduces a novel Heterogeneous Kernel Reparameterization (HKR) strategy that significantly accelerates inference without compromising accuracy. Combined with multi-kernel Inception modules and depthwise separable convolutions, the proposed method achieves state-of-the-art performance on four remote sensing benchmarks, including DOTA-v1.0, and runs 3.9× faster than PKINet-v1 at inference time.
📝 Abstract
Object detection in remote sensing images (RSIs) is challenged by the coexistence of geometric and spatial complexity: targets may appear with diverse aspect ratios, while spanning a wide range of object sizes under varied contexts. Existing RSI backbones address the two challenges separately, either by adopting anisotropic strip kernels to model slender targets or by using isotropic large kernels to capture broader context. However, such isolated treatments lead to complementary drawbacks: the strip-only design can disrupt spatial coherence for regular-shaped objects and weaken tiny details, whereas isotropic large kernels often introduce severe background noise and geometric mismatch for slender structures. In this paper, we extend PKINet, and present a powerful and efficient backbone that jointly handles both challenges within a unified paradigm named Poly Kernel Inception Network v2 (PKINet-v2). PKINet-v2 synergizes anisotropic axial-strip convolutions with isotropic square kernels and builds a multi-scope receptive field, preserving fine-grained local textures while progressively aggregating long-range context across scales. To enable efficient deployment, we further introduce a Heterogeneous Kernel Re-parameterization (HKR) Strategy that fuses all heterogeneous branches into a single depth-wise convolution for inference, eliminating fragmented kernel launches without accuracy loss. Extensive experiments on four widely-used benchmarks, including DOTA-v1.0, DOTA-v1.5, HRSC2016, and DIOR-R, demonstrate that PKINet-v2 achieves state-of-the-art accuracy while delivering a $\textbf{3.9}\times$ FPS acceleration compared to PKINet-v1, surpassing previous remote sensing backbones in both effectiveness and efficiency.