๐ค AI Summary
To address the prohibitive computational and memory overhead of training neural networks on large-scale data, this paper proposes a streaming data subset selection method. Our approach constructs a compact gradient sketch via Frequent Directions to approximate the principal gradient subspace in constant memory; it then introduces a consistency scoring mechanism that prioritizes samples aligned with the dominant (consensus) directions of the sketchโavoiding pairwise similarity computations and explicit gradient storage. The method employs a two-pass, GPU-friendly pipeline enabling efficient online selection. Extensive experiments across multiple benchmarks show that retaining only 1โ5% of the data achieves โฅ98% of the full-dataset accuracy, while significantly reducing end-to-end computation and peak memory usage. To our knowledge, this is the first streaming subset selection framework for large models that provides deterministic approximation guarantees and constant-memory complexity.
๐ Abstract
Training modern neural networks on large datasets is computationally and energy intensive. We present SAGE, a streaming data-subset selection method that maintains a compact Frequent Directions (FD) sketch of gradient geometry in $O(ell D)$ memory and prioritizes examples whose sketched gradients align with a consensus direction. The approach eliminates $N imes N$ pairwise similarities and explicit $N imes ell$ gradient stores, yielding a simple two-pass, GPU-friendly pipeline. Leveraging FD's deterministic approximation guarantees, we analyze how agreement scoring preserves gradient energy within the principal sketched subspace. Across multiple benchmarks, SAGE trains with small kept-rate budgets while retaining competitive accuracy relative to full-data training and recent subset-selection baselines, and reduces end-to-end compute and peak memory. Overall, SAGE offers a practical, constant-memory alternative that complements pruning and model compression for efficient training.