SAGE: Streaming Agreement-Driven Gradient Sketches for Representative Subset Selection

๐Ÿ“… 2025-10-02
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the prohibitive computational and memory overhead of training neural networks on large-scale data, this paper proposes a streaming data subset selection method. Our approach constructs a compact gradient sketch via Frequent Directions to approximate the principal gradient subspace in constant memory; it then introduces a consistency scoring mechanism that prioritizes samples aligned with the dominant (consensus) directions of the sketchโ€”avoiding pairwise similarity computations and explicit gradient storage. The method employs a two-pass, GPU-friendly pipeline enabling efficient online selection. Extensive experiments across multiple benchmarks show that retaining only 1โ€“5% of the data achieves โ‰ฅ98% of the full-dataset accuracy, while significantly reducing end-to-end computation and peak memory usage. To our knowledge, this is the first streaming subset selection framework for large models that provides deterministic approximation guarantees and constant-memory complexity.

Technology Category

Application Category

๐Ÿ“ Abstract
Training modern neural networks on large datasets is computationally and energy intensive. We present SAGE, a streaming data-subset selection method that maintains a compact Frequent Directions (FD) sketch of gradient geometry in $O(ell D)$ memory and prioritizes examples whose sketched gradients align with a consensus direction. The approach eliminates $N imes N$ pairwise similarities and explicit $N imes ell$ gradient stores, yielding a simple two-pass, GPU-friendly pipeline. Leveraging FD's deterministic approximation guarantees, we analyze how agreement scoring preserves gradient energy within the principal sketched subspace. Across multiple benchmarks, SAGE trains with small kept-rate budgets while retaining competitive accuracy relative to full-data training and recent subset-selection baselines, and reduces end-to-end compute and peak memory. Overall, SAGE offers a practical, constant-memory alternative that complements pruning and model compression for efficient training.
Problem

Research questions and friction points this paper is trying to address.

Selects representative data subsets for efficient neural network training
Reduces memory usage by eliminating pairwise gradient similarity computations
Maintains competitive accuracy while lowering computational and energy costs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Streaming gradient sketches for subset selection
Frequent Directions sketch in constant memory
GPU-friendly pipeline eliminating pairwise similarities
๐Ÿ”Ž Similar Papers
No similar papers found.