Optimizing Cloud-to-GPU Throughput for Deep Learning With Earth Observation Data

📅 2025-06-06

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

To address severe GPU underutilization (0–30%) in Earth observation (EO) model training on cloud platforms—caused by I/O bottlenecks in PyTorch’s default data loader when reading GeoTIFFs from cloud storage—this paper proposes a systematic optimization framework for streaming GeoTIFF loading. The method introduces Bayesian optimization to automatically tune data-loading hyperparameters and designs a tile-aligned I/O scheduler coupled with a dynamic thread pool to enable high-throughput, low-latency direct cloud storage access. We deeply customize PyTorch’s DataLoader to support dual-path I/O (Azure Blob Storage and local SSD) and evaluate the framework across three EO benchmark tasks. Results show a 20× improvement in cloud storage throughput, sustained GPU utilization of 85–95%, preservation of model accuracy relative to local SSD training, and IoU gains of 6–15%.

Technology Category

Application Category

📝 Abstract

Training deep learning models on petabyte-scale Earth observation (EO) data requires separating compute resources from data storage. However, standard PyTorch data loaders cannot keep modern GPUs utilized when streaming GeoTIFF files directly from cloud storage. In this work, we benchmark GeoTIFF loading throughput from both cloud object storage and local SSD, systematically testing different loader configurations and data parameters. We focus on tile-aligned reads and worker thread pools, using Bayesian optimization to find optimal settings for each storage type. Our optimized configurations increase remote data loading throughput by 20x and local throughput by 4x compared to default settings. On three public EO benchmarks, models trained with optimized remote loading achieve the same accuracy as local training within identical time budgets. We improve validation IoU by 6-15% and maintain 85-95% GPU utilization versus 0-30% with standard configurations. Code is publicly available at https://github.com/microsoft/pytorch-cloud-geotiff-optimization

Problem

Research questions and friction points this paper is trying to address.

Enhance GeoTIFF loading throughput for deep learning

Optimize cloud-to-GPU data transfer efficiency

Improve GPU utilization during Earth observation training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimizes GeoTIFF loading from cloud storage

Uses Bayesian optimization for loader settings

Achieves high GPU utilization with remote data

🔎 Similar Papers

No similar papers found.