🤖 AI Summary
Existing remote sensing benchmarks for agricultural monitoring struggle to address the challenges of terraced field delineation in mountainous regions, where complex terrain, irregular boundaries, and high spatial heterogeneity impede accurate parcel extraction. To tackle this limitation, this work introduces GTPBD-MM, the first global multimodal benchmark specifically designed for terraces, integrating high-resolution optical imagery, structured textual descriptions, and digital elevation models (DEMs). The study also proposes ETTerra, a multimodal baseline method that enables both unimodal (image-only) and multimodal (image–text–DEM) evaluation, achieving the first-ever alignment across these three modalities. Experimental results demonstrate that jointly leveraging semantic textual cues and topographic geometric information significantly enhances the accuracy, coherence, and structural consistency of extracted terrace boundaries, outperforming purely vision-based approaches.
📝 Abstract
Agricultural parcel extraction plays an important role in remote sensing-based agricultural monitoring, supporting parcel surveying, precision management, and ecological assessment. However, existing public benchmarks mainly focus on regular and relatively flat farmland scenes. In contrast, terraced parcels in mountainous regions exhibit stepped terrain, pronounced elevation variation, irregular boundaries, and strong cross-regional heterogeneity, making parcel extraction a more challenging problem that jointly requires visual recognition, semantic discrimination, and terrain-aware geometric understanding. Although recent studies have advanced visual parcel benchmarks and image-text farmland understanding, a unified benchmark for complex terraced parcel extraction under aligned image-text-DEM settings remains absent. To fill this gap, we present GTPBD-MM, the first multimodal benchmark for global terraced parcel extraction. Built upon GTPBD, GTPBD-MM integrates high-resolution optical imagery, structured text descriptions, and DEM data, and supports systematic evaluation under Image-only, Image+Text, and Image+Text+DEM settings. We further propose Elevation and Text guided Terraced parcel network (ETTerra), a multimodal baseline for terraced parcel delineation. Extensive experiments demonstrate that textual semantics and terrain geometry provide complementary cues beyond visual appearance alone, yielding more accurate, coherent, and structurally consistent delineation results in complex terraced scenes.