SAGE: Streaming Agreement-Driven Gradient Sketches for Representative Subset Selection

📅 2025-10-02

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

To address the prohibitive computational and memory overhead of training neural networks on large-scale data, this paper proposes a streaming data subset selection method. Our approach constructs a compact gradient sketch via Frequent Directions to approximate the principal gradient subspace in constant memory; it then introduces a consistency scoring mechanism that prioritizes samples aligned with the dominant (consensus) directions of the sketch—avoiding pairwise similarity computations and explicit gradient storage. The method employs a two-pass, GPU-friendly pipeline enabling efficient online selection. Extensive experiments across multiple benchmarks show that retaining only 1–5% of the data achieves ≥98% of the full-dataset accuracy, while significantly reducing end-to-end computation and peak memory usage. To our knowledge, this is the first streaming subset selection framework for large models that provides deterministic approximation guarantees and constant-memory complexity.

Technology Category

Application Category

📝 Abstract

Training modern neural networks on large datasets is computationally and energy intensive. We present SAGE, a streaming data-subset selection method that maintains a compact Frequent Directions (FD) sketch of gradient geometry in $O(ell D)$ memory and prioritizes examples whose sketched gradients align with a consensus direction. The approach eliminates $N imes N$ pairwise similarities and explicit $N imes ell$ gradient stores, yielding a simple two-pass, GPU-friendly pipeline. Leveraging FD's deterministic approximation guarantees, we analyze how agreement scoring preserves gradient energy within the principal sketched subspace. Across multiple benchmarks, SAGE trains with small kept-rate budgets while retaining competitive accuracy relative to full-data training and recent subset-selection baselines, and reduces end-to-end compute and peak memory. Overall, SAGE offers a practical, constant-memory alternative that complements pruning and model compression for efficient training.

Problem

Research questions and friction points this paper is trying to address.

Selects representative data subsets for efficient neural network training

Reduces memory usage by eliminating pairwise gradient similarity computations

Maintains competitive accuracy while lowering computational and energy costs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Streaming gradient sketches for subset selection

Frequent Directions sketch in constant memory

GPU-friendly pipeline eliminating pairwise similarities

🔎 Similar Papers

S+t-SNE - Bringing dimensionality reduction to data streams