Generalized Referring Expression Segmentation on Aerial Photos

📅 2025-12-08

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Aerial image referring expression segmentation faces challenges including large resolution variations, inconsistent color characteristics, small and densely packed objects, and partial occlusions. To address these, we propose Aerial-D, the first large-scale, cross-era aerial image referring expression segmentation dataset, comprising over 1.5 million diverse language expressions generated via a hybrid pipeline integrating rule-based systems and large language models. We further introduce an imaging degradation simulation filter to adapt modern models to historical imagery (e.g., grayscale, sepia-toned, or grainy). Our method extends the RSRefSeg architecture to enable unified segmentation modeling for both contemporary and historical aerial images. Experiments demonstrate state-of-the-art performance on modern benchmarks and exceptional robustness across multiple synthetic degradation conditions. Aerial-D is the first dataset and framework enabling precise, text-driven, multi-temporal aerial image segmentation.

Technology Category

Application Category

📝 Abstract

Referring expression segmentation is a fundamental task in computer vision that integrates natural language understanding with precise visual localization of target regions. Considering aerial imagery (e.g., modern aerial photos collected through drones, historical photos from aerial archives, high-resolution satellite imagery, etc.) presents unique challenges because spatial resolution varies widely across datasets, the use of color is not consistent, targets often shrink to only a few pixels, and scenes contain very high object densities and objects with partial occlusions. This work presents Aerial-D, a new large-scale referring expression segmentation dataset for aerial imagery, comprising 37,288 images with 1,522,523 referring expressions that cover 259,709 annotated targets, spanning across individual object instances, groups of instances, and semantic regions covering 21 distinct classes that range from vehicles and infrastructure to land coverage types. The dataset was constructed through a fully automatic pipeline that combines systematic rule-based expression generation with a Large Language Model (LLM) enhancement procedure that enriched both the linguistic variety and the focus on visual details within the referring expressions. Filters were additionally used to simulate historic imaging conditions for each scene. We adopted the RSRefSeg architecture, and trained models on Aerial-D together with prior aerial datasets, yielding unified instance and semantic segmentation from text for both modern and historical images. Results show that the combined training achieves competitive performance on contemporary benchmarks, while maintaining strong accuracy under monochrome, sepia, and grainy degradations that appear in archival aerial photography. The dataset, trained models, and complete software pipeline are publicly available at https://luispl77.github.io/aerial-d .

Problem

Research questions and friction points this paper is trying to address.

Develops a dataset for text-guided segmentation in aerial imagery

Addresses challenges like variable resolution and high object density

Enables unified segmentation for both modern and historical aerial photos

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale dataset with automatic LLM-enhanced expression generation

Unified model for instance and semantic segmentation from text

Robust performance across modern and degraded historical aerial imagery

🔎 Similar Papers

Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation