Vision-LSTM: xLSTM as Generic Vision Backbone

📅 2024-06-06

🏛️ arXiv.org

📈 Citations: 18

✨ Influential: 1

career value

167K/year

🤖 AI Summary

Vision backbones overly rely on Transformers, while conventional LSTMs suffer from poor parallelizability and weak long-range modeling capability. Method: We propose Vision-LSTM (ViL), the first general-purpose vision backbone based on scalable xLSTM—featuring exponential gating and parallelized matrix memory—adapted for image patch sequence modeling. ViL introduces a novel bidirectional alternating processing mechanism to jointly capture local details and global dependencies, combined with lightweight patch embedding and sequential modeling. Contribution/Results: ViL achieves significantly enhanced representational capacity with low computational overhead. Experiments demonstrate that ViL consistently outperforms LSTM-based baselines on ImageNet classification, COCO object detection, and ADE20K semantic segmentation, while approaching the performance of state-of-the-art Vision Transformers (ViTs). This validates ViL as a highly efficient, general-purpose visual backbone with strong practical potential.

Technology Category

Application Category

📝 Abstract

Transformers are widely used as generic backbones in computer vision, despite initially introduced for natural language processing. Recently, the Long Short-Term Memory (LSTM) has been extended to a scalable and performant architecture - the xLSTM - which overcomes long-standing LSTM limitations via exponential gating and parallelizable matrix memory structure. In this report, we introduce Vision-LSTM (ViL), an adaption of the xLSTM building blocks to computer vision. ViL comprises a stack of xLSTM blocks where odd blocks process the sequence of patch tokens from top to bottom while even blocks go from bottom to top. Experiments show that ViL holds promise to be further deployed as new generic backbone for computer vision architectures.

Problem

Research questions and friction points this paper is trying to address.

Adapts xLSTM for computer vision tasks

Overcomes LSTM limitations via new architecture

Proposes Vision-LSTM as a new vision backbone

Innovation

Methods, ideas, or system contributions that make the work stand out.

xLSTM adaptation to vision

Exponential gating enhances performance

Alternating sequence processing blocks

🔎 Similar Papers

Are Vision xLSTM Embedded UNet More Reliable in Medical 3D Image Segmentation?