iiANET: Inception Inspired Attention Hybrid Network for efficient Long-Range Dependency

📅 2024-07-10

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 0

career value

218K/year

🤖 AI Summary

To address the inefficiency of modeling long-range dependencies in complex images, this paper proposes iiANET, an efficient hybrid network. Its core innovation is the iiABlock module—a novel architectural unit that concurrently integrates Inception-style multi-branch pathways, Registers-based memory mechanisms, 2D multi-head self-attention (2D-MHSA), MBConv2, and dilated convolutions, followed by ECANET for channel-wise attention calibration. This design synergistically balances global contextual modeling with fine-grained local feature capture, substantially improving long-range dependency modeling efficiency. Evaluated on multiple benchmark datasets, iiANET achieves superior performance over several state-of-the-art models while incurring significantly lower computational overhead—demonstrating its effectiveness, efficiency, and strong generalization capability.

Technology Category

Application Category

📝 Abstract

The recent emergence of hybrid models has introduced another transformative approach to solving computer vision tasks, slowly shifting away from conventional CNN (Convolutional Neural Network) and ViT (Vision Transformer). However, not enough effort has been made to efficiently combine these two approaches to improve capturing long-range dependencies prevalent in complex images. In this paper, we introduce iiANET (Inception Inspired Attention Network), an efficient hybrid model designed to capture long-range dependencies in complex images. The fundamental building block, iiABlock, integrates global 2D-MHSA (Multi-Head Self-Attention) with Registers, MBConv2 (MobileNetV2-based convolution), and dilated convolution in parallel, enabling the model to adeptly leverage self-attention for capturing long-range dependencies while utilizing MBConv2 for effective local-detail extraction and dilated convolution for efficiently expanding the kernel receptive field to capture more contextual information. Lastly, we serially integrate an ECANET (Efficient Channel Attention Network) at the end of each iiABlock to calibrate channel-wise attention for enhanced model performance. Extensive qualitative and quantitative comparative evaluation on various benchmarks demonstrates improved performance over some state-of-the-art models.

Problem

Research questions and friction points this paper is trying to address.

Efficiently combining CNNs and vision transformers for long-range dependencies

Improving global and local feature extraction in complex images

Achieving strong feature interaction with computational efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Inception inspired hybrid attention network

Parallel global-local feature fusion

Efficient r-MHSA and convolution integration

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs