RS-Prune: Training-Free Data Pruning at High Ratios for Efficient Remote Sensing Diffusion Foundation Models

๐Ÿ“… 2025-12-29
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Remote sensing diffusion foundation models suffer from training data redundancy, noise contamination, and severe class imbalance; existing methods neglect the distributional requirements of generative modeling and the intrinsic heterogeneity of remote sensing imagery. To address this, we propose a novel, training-free, two-stage scene-aware data pruning framework: first performing coarse filtering via local information entropy, then conducting hierarchical clustering and representative sampling guided by remote sensing scene classification benchmarks. This work establishes the first โ€œtraining-agnostic + scene-awareโ€ pruning paradigm, effectively balancing fine-grained fidelity and global diversity. Under an aggressive 85% pruning ratio, our method significantly accelerates model convergence and improves generation quality. Extensive experiments demonstrate state-of-the-art performance across downstream tasks, including remote sensing image super-resolution and semantic image synthesis.

Technology Category

Application Category

๐Ÿ“ Abstract
Diffusion-based remote sensing (RS) generative foundation models are cruial for downstream tasks. However, these models rely on large amounts of globally representative data, which often contain redundancy, noise, and class imbalance, reducing training efficiency and preventing convergence. Existing RS diffusion foundation models typically aggregate multiple classification datasets or apply simplistic deduplication, overlooking the distributional requirements of generation modeling and the heterogeneity of RS imagery. To address these limitations, we propose a training-free, two-stage data pruning approach that quickly select a high-quality subset under high pruning ratios, enabling a preliminary foundation model to converge rapidly and serve as a versatile backbone for generation, downstream fine-tuning, and other applications. Our method jointly considers local information content with global scene-level diversity and representativeness. First, an entropy-based criterion efficiently removes low-information samples. Next, leveraging RS scene classification datasets as reference benchmarks, we perform scene-aware clustering with stratified sampling to improve clustering effectiveness while reducing computational costs on large-scale unlabeled data. Finally, by balancing cluster-level uniformity and sample representativeness, the method enables fine-grained selection under high pruning ratios while preserving overall diversity and representativeness. Experiments show that, even after pruning 85% of the training data, our method significantly improves convergence and generation quality. Furthermore, diffusion foundation models trained with our method consistently achieve state-of-the-art performance across downstream tasks, including super-resolution and semantic image synthesis. This data pruning paradigm offers practical guidance for developing RS generative foundation models.
Problem

Research questions and friction points this paper is trying to address.

Prunes redundant remote sensing data for diffusion models
Addresses class imbalance and noise in training datasets
Enables efficient model convergence with high pruning ratios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free two-stage pruning for high ratios
Entropy filtering and scene-aware clustering selection
Balances cluster uniformity and sample representativeness
Fan Wei
Fan Wei
Department of Mathematics, Princeton University
AnalysisCombinatoricsProbability
R
Runmin Dong
School of Artificial Intelligence, Sun Yat-sen University
Y
Yushan Lai
Tsinghua Shenzhen International Graduate School, Tsinghua University
Y
Yixiang Yang
Wangxuan Institute of Computer Technology, Peking University
Z
Zhaoyang Luo
Tsinghua Shenzhen International Graduate School, Tsinghua University
J
Jinxiao Zhang
Department of Earth System Science, Tsinghua University
M
Miao Yang
Department of Earth System Science, Tsinghua University
S
Shuai Yuan
Department of Geography, The University of Hong Kong
J
Jiyao Zhao
National Supercomputing Center in Shenzhen
B
Bin Luo
Tsinghua Shenzhen International Graduate School, Tsinghua University
Haohuan Fu
Haohuan Fu
Tsinghua University