🤖 AI Summary
This work addresses the limitations of existing semantic segmentation methods for ultra-high-resolution images, which either lose global context due to sliding-window inference or sacrifice fine-grained details through aggressive downsampling. To overcome this, the authors propose a local–global dual-branch architecture that integrates learnable relay tokens into standard vision Transformers (e.g., ViT, Swin) to explicitly enable multi-scale feature interaction. The high-resolution branch preserves local details, while the low-resolution branch captures global semantics, with the two streams efficiently fused via a small set of relay tokens. This mechanism incurs less than a 2% increase in model parameters yet consistently improves performance across multiple benchmarks—including Archaeoscape, URUR, Gleason, and Cityscapes—achieving up to a 15% relative gain in mIoU.
📝 Abstract
Current approaches for segmenting ultra high resolution images either slide a window, thereby discarding global context, or downsample and lose fine detail. We propose a simple yet effective method that brings explicit multi scale reasoning to vision transformers, simultaneously preserving local details and global awareness. Concretely, we process each image in parallel at a local scale (high resolution, small crops) and a global scale (low resolution, large crops), and aggregate and propagate features between the two branches with a small set of learnable relay tokens. The design plugs directly into standard transformer backbones (eg ViT and Swin) and adds fewer than 2 % parameters. Extensive experiments on three ultra high resolution segmentation benchmarks, Archaeoscape, URUR, and Gleason, and on the conventional Cityscapes dataset show consistent gains, with up to 15 % relative mIoU improvement. Code and pretrained models are available at https://archaeoscape.ai/work/relay-tokens/ .