🤖 AI Summary
This work addresses the challenge of efficiently capturing fine-grained features of small objects in ultra-high-resolution remote sensing imagery, where the extremely high spatial resolution leads to an explosion in the number of visual tokens. To tackle this issue, the authors propose a query-guided, region-faithful token compression framework that enables accurate and efficient feature extraction under strict computational budgets. The method integrates text-guided multi-scale importance scoring, a region-aware retain-and-merge strategy, and joint vision-language modeling to preserve critical details while controlling computational overhead. Experimental results demonstrate that the proposed approach significantly reduces computational costs across multiple remote sensing benchmarks while maintaining or even improving recognition performance on tiny objects.
📝 Abstract
Ultra-high-resolution (UHR) remote sensing imagery couples kilometer-scale context with query-critical evidence that may occupy only a few pixels. Such vast spatial scale leads to a quadratic explosion of visual tokens and hinders the extraction of information from small objects. Previous works utilize direct downsampling, dense tiling, or global top-k pruning, which either compromise query-critical image details or incur unpredictable compute. In this paper, we propose UHR-BAT, a query-guided and region-faithful token compression framework to efficiently select visual tokens under a strict context budget. Specifically, we leverage text-guided, multi-scale importance estimation for visual tokens, effectively tackling the challenge of achieving precise yet low-cost feature extraction. Furthermore, by introducing region-wise preserve and merge strategies, we mitigate visual token redundancy, further driving down the computational budget. Experimental results show that UHR-BAT achieves state-of-the-art performance across various benchmarks. Code will be available at https://github.com/Yunkaidang/UHR.