🤖 AI Summary
This work addresses the challenges of label noise and geographic distribution shift in plant species distribution modeling, which arise from sparse and biased observational data. To tackle these issues, the authors propose a multimodal fusion framework that effectively integrates noisy presence-only (PO) records with scarce but high-quality presence-absence (PA) data. A satellite-imagery-based geospatial alignment strategy generates pseudo-labels to ensure consistency between PO annotations and remote sensing feature spaces. The framework further introduces a stackable tri-modal sequential cross-attention mechanism that fuses image features processed by Swin Transformer, tabular features extracted via TabM, and temporal dynamics modeled by a time-series Swin architecture. Coupled with spatial proximity-based partitioning and a mixture-of-experts inference design, the approach substantially mitigates the adverse effects of distribution shift and label noise, achieving significant performance gains on the GeoLifeCLEF 2025 benchmark—particularly in scenarios with extremely sparse PA data and severe geographic bias.
📝 Abstract
Large-scale, cross-species plant distribution prediction plays a crucial role in biodiversity conservation, yet modeling efforts in this area still face significant challenges due to the sparsity and bias of observational data. Presence-Absence (PA) data provide accurate and noise-free labels, but are costly to obtain and limited in quantity; Presence-Only (PO) data, by contrast, offer broad spatial coverage and rich spatiotemporal distribution, but suffer from severe label noise in negative samples. To address these real-world constraints, this paper proposes a multimodal fusion framework that fully leverages the strengths of both PA and PO data. We introduce an innovative pseudo-label aggregation strategy for PO data based on the geographic coverage of satellite imagery, enabling geographic alignment between the label space and remote sensing feature space. In terms of model architecture, we adopt Swin Transformer Base as the backbone for satellite imagery, utilize the TabM network for tabular feature extraction, retain the Temporal Swin Transformer for time-series modeling, and employ a stackable serial tri-modal cross-attention mechanism to optimize the fusion of heterogeneous modalities. Furthermore, empirical analysis reveals significant geographic distribution shifts between PA training and test samples, and models trained by directly mixing PO and PA data tend to experience performance degradation due to label noise in PO data. To address this, we draw on the mixture-of-experts paradigm: test samples are partitioned according to their spatial proximity to PA samples, and different models trained on distinct datasets are used for inference and post-processing within each partition. Experiments on the GeoLifeCLEF 2025 dataset demonstrate that our approach achieves superior predictive performance in scenarios with limited PA coverage and pronounced distribution shifts.