🤖 AI Summary
To address severe GPU underutilization (0–30%) in Earth observation (EO) model training on cloud platforms—caused by I/O bottlenecks in PyTorch’s default data loader when reading GeoTIFFs from cloud storage—this paper proposes a systematic optimization framework for streaming GeoTIFF loading. The method introduces Bayesian optimization to automatically tune data-loading hyperparameters and designs a tile-aligned I/O scheduler coupled with a dynamic thread pool to enable high-throughput, low-latency direct cloud storage access. We deeply customize PyTorch’s DataLoader to support dual-path I/O (Azure Blob Storage and local SSD) and evaluate the framework across three EO benchmark tasks. Results show a 20× improvement in cloud storage throughput, sustained GPU utilization of 85–95%, preservation of model accuracy relative to local SSD training, and IoU gains of 6–15%.
📝 Abstract
Training deep learning models on petabyte-scale Earth observation (EO) data requires separating compute resources from data storage. However, standard PyTorch data loaders cannot keep modern GPUs utilized when streaming GeoTIFF files directly from cloud storage. In this work, we benchmark GeoTIFF loading throughput from both cloud object storage and local SSD, systematically testing different loader configurations and data parameters. We focus on tile-aligned reads and worker thread pools, using Bayesian optimization to find optimal settings for each storage type. Our optimized configurations increase remote data loading throughput by 20x and local throughput by 4x compared to default settings. On three public EO benchmarks, models trained with optimized remote loading achieve the same accuracy as local training within identical time budgets. We improve validation IoU by 6-15% and maintain 85-95% GPU utilization versus 0-30% with standard configurations. Code is publicly available at https://github.com/microsoft/pytorch-cloud-geotiff-optimization