Mapping on a Budget: Optimizing Spatial Data Collection for ML

📅 2025-09-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Satellite imagery machine learning (SatML) confronts core challenges including sparse training data, highly clustered spatial distributions, heterogeneous annotation costs, and strict budget constraints. Prior work predominantly focuses on model improvements while neglecting explicit modeling and optimization of the data acquisition process. This paper formalizes spatial training data collection as a constrained optimization problem—subject to heterogeneous per-sample annotation costs and a global budget—and proposes a generalizable framework for active sampling across regions and tasks. Integrating geospatial analysis, optimal sampling design, and closed-loop ML performance evaluation, we conduct large-scale simulation experiments on agricultural monitoring across multiple continents. Results demonstrate that our framework significantly improves model generalization over random or uniform sampling—especially when initial data are severely spatially clustered—thereby establishing a scalable, cost-effective paradigm for large-scale remote sensing monitoring.

Technology Category

Application Category

📝 Abstract
In applications across agriculture, ecology, and human development, machine learning with satellite imagery (SatML) is limited by the sparsity of labeled training data. While satellite data cover the globe, labeled training datasets for SatML are often small, spatially clustered, and collected for other purposes (e.g., administrative surveys or field measurements). Despite the pervasiveness of this issue in practice, past SatML research has largely focused on new model architectures and training algorithms to handle scarce training data, rather than modeling data conditions directly. This leaves scientists and policymakers who wish to use SatML for large-scale monitoring uncertain about whether and how to collect additional data to maximize performance. Here, we present the first problem formulation for the optimization of spatial training data in the presence of heterogeneous data collection costs and realistic budget constraints, as well as novel methods for addressing this problem. In experiments simulating different problem settings across three continents and four tasks, our strategies reveal substantial gains from sample optimization. Further experiments delineate settings for which optimized sampling is particularly effective. The problem formulation and methods we introduce are designed to generalize across application domains for SatML; we put special emphasis on a specific problem setting where our coauthors can immediately use our findings to augment clustered agricultural surveys for SatML monitoring in Togo.
Problem

Research questions and friction points this paper is trying to address.

Optimizing spatial training data collection under budget constraints
Addressing sparse and clustered labeled data for satellite ML
Minimizing data collection costs while maximizing model performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimizing spatial training data collection
Heterogeneous data collection costs modeling
Budget constraints for sample optimization
🔎 Similar Papers
No similar papers found.
L
Livia Betti
University of Colorado Boulder
F
Farooq Sanni
Togo Data Lab
G
Gnouyaro Sogoyou
Togo Data Lab
T
Togbe Agbagla
Togo Data Lab
C
Cullen Molitor
Center for Effective Global Action
T
Tamma Carleton
Center for Effective Global Action, University of California, Berkeley
Esther Rolf
Esther Rolf
Assistant Professor, CU Boulder
machine learning