Adapting Vision Transformers to Ultra-High Resolution Semantic Segmentation with Relay Tokens

📅 2026-01-09
🏛️ Trans. Mach. Learn. Res.
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing semantic segmentation methods for ultra-high-resolution images, which either lose global context due to sliding-window inference or sacrifice fine-grained details through aggressive downsampling. To overcome this, the authors propose a local–global dual-branch architecture that integrates learnable relay tokens into standard vision Transformers (e.g., ViT, Swin) to explicitly enable multi-scale feature interaction. The high-resolution branch preserves local details, while the low-resolution branch captures global semantics, with the two streams efficiently fused via a small set of relay tokens. This mechanism incurs less than a 2% increase in model parameters yet consistently improves performance across multiple benchmarks—including Archaeoscape, URUR, Gleason, and Cityscapes—achieving up to a 15% relative gain in mIoU.

Technology Category

Application Category

📝 Abstract
Current approaches for segmenting ultra high resolution images either slide a window, thereby discarding global context, or downsample and lose fine detail. We propose a simple yet effective method that brings explicit multi scale reasoning to vision transformers, simultaneously preserving local details and global awareness. Concretely, we process each image in parallel at a local scale (high resolution, small crops) and a global scale (low resolution, large crops), and aggregate and propagate features between the two branches with a small set of learnable relay tokens. The design plugs directly into standard transformer backbones (eg ViT and Swin) and adds fewer than 2 % parameters. Extensive experiments on three ultra high resolution segmentation benchmarks, Archaeoscape, URUR, and Gleason, and on the conventional Cityscapes dataset show consistent gains, with up to 15 % relative mIoU improvement. Code and pretrained models are available at https://archaeoscape.ai/work/relay-tokens/ .
Problem

Research questions and friction points this paper is trying to address.

ultra-high resolution
semantic segmentation
global context
fine detail
vision transformers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Relay Tokens
Vision Transformers
Ultra-High Resolution Semantic Segmentation
Multi-scale Reasoning
Global-Local Feature Aggregation
🔎 Similar Papers
No similar papers found.