🤖 AI Summary
This work addresses the challenge that existing texture recognition methods struggle to simultaneously preserve spatial structure and explicitly model second-order interactions among local channels. To this end, we propose TwistNet-2D, a lightweight module featuring a novel spiral-twist channel interaction mechanism. By computing pairwise channel products through four-directional spiral displacements, our method explicitly encodes cross-positional feature co-occurrence and interaction patterns. Integrated with learnable channel reweighting and a Sigmoid-gated residual connection, TwistNet-2D can be seamlessly embedded into backbone architectures such as ResNet. Despite introducing only a 3.5% increase in parameters and 2% additional FLOPs, our approach outperforms larger models—including ConvNeXt and Swin Transformer—on four benchmark datasets for texture and fine-grained visual recognition.
📝 Abstract
Second-order feature statistics are central to texture recognition, yet current methods face a fundamental tension: bilinear pooling and Gram matrices capture global channel correlations but collapse spatial structure, while self-attention models spatial context through weighted aggregation rather than explicit pairwise feature interactions. We introduce TwistNet-2D, a lightweight module that computes \emph{local} pairwise channel products under directional spatial displacement, jointly encoding where features co-occur and how they interact. The core component, Spiral-Twisted Channel Interaction (STCI), shifts one feature map along a prescribed direction before element-wise channel multiplication, thereby capturing the cross-position co-occurrence patterns characteristic of structured and periodic textures. Aggregating four directional heads with learned channel reweighting and injecting the result through a sigmoid-gated residual path, \TwistNet incurs only 3.5% additional parameters and 2% additional FLOPs over ResNet-18, yet consistently surpasses both parameter-matched and substantially larger baselines -- including ConvNeXt, Swin Transformer, and hybrid CNN--Transformer architectures -- across four texture and fine-grained recognition benchmarks.