Atlas: Multi-Scale Attention Improves Long Context Image Modeling

📅 2025-03-16

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

To address the challenge of balancing computational complexity and modeling accuracy in high-resolution, long-context image modeling, this paper introduces a Multi-Scale Attention (MSA) mechanism and the Atlas network architecture. Methodologically, it (1) constructs an *O*(log *N*)-depth hierarchical feature pyramid for compact multi-scale representation, and (2) designs bidirectional cross-scale self-attention to enable efficient long-range information propagation. On ImageNet-100, Atlas achieves 91.04% top-1 accuracy at 1024×1024 resolution—4.3× faster than ConvNeXt-B and 2.95× faster than FasterViT while improving accuracy by 7.38%; at 4096×4096, it outperforms MambaVision-S by 32% in accuracy. This work is the first to jointly integrate hierarchical multi-scale modeling and bidirectional cross-scale attention into the Transformer framework, achieving significant improvements in the efficiency–accuracy trade-off without sacrificing theoretical scalability.

Technology Category

Application Category

📝 Abstract

Efficiently modeling massive images is a long-standing challenge in machine learning. To this end, we introduce Multi-Scale Attention (MSA). MSA relies on two key ideas, (i) multi-scale representations (ii) bi-directional cross-scale communication. MSA creates O(log N) scales to represent the image across progressively coarser features and leverages cross-attention to propagate information across scales. We then introduce Atlas, a novel neural network architecture based on MSA. We demonstrate that Atlas significantly improves the compute-performance tradeoff of long-context image modeling in a high-resolution variant of ImageNet 100. At 1024px resolution, Atlas-B achieves 91.04% accuracy, comparable to ConvNext-B (91.92%) while being 4.3x faster. Atlas is 2.95x faster and 7.38% better than FasterViT, 2.25x faster and 4.96% better than LongViT. In comparisons against MambaVision-S, we find Atlas-S achieves 5%, 16% and 32% higher accuracy at 1024px, 2048px and 4096px respectively, while obtaining similar runtimes. Code for reproducing our experiments and pretrained models is available at https://github.com/yalalab/atlas.

Problem

Research questions and friction points this paper is trying to address.

Efficiently modeling massive images using multi-scale attention

Improving compute-performance tradeoff in long-context image modeling

Enhancing accuracy and speed in high-resolution image processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Scale Attention for image modeling

Bi-directional cross-scale communication

Atlas architecture improves compute-performance tradeoff

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs