π€ AI Summary
This work investigates how to efficiently extract and compress task-relevant information from gradient-based training on nonlinear tasks into synthetic data, thereby reducing both optimization and storage costs. By analyzing the data distillation process in two-layer neural networks under multi-index model tasks, the study establishes, for the first time, a theoretical framework for gradient-only data distillation that explicitly incorporates the intrinsic structure of nonlinear tasks. Leveraging the intrinsic dimensionality \( r \) of the task, the framework quantifies the achievable compression rate. Theoretical results demonstrate that distilled data can reproduce models with high generalization performance using memory complexity \( \tilde{\Theta}(r^2 d + L) \), where \( d \) is the input dimension and \( L \) is the number of tasks, revealing that low-dimensional task structures can be efficiently encoded into synthetic data.
π Abstract
Dataset distillation, a training-aware data compression technique, has recently attracted increasing attention as an effective tool for mitigating costs of optimization and data storage. However, progress remains largely empirical. Mechanisms underlying the extraction of task-relevant information from the training process and the efficient encoding of such information into synthetic data points remain elusive. In this paper, we theoretically analyze practical algorithms of dataset distillation applied to the gradient-based training of two-layer neural networks with width $L$. By focusing on a non-linear task structure called multi-index model, we prove that the low-dimensional structure of the problem is efficiently encoded into the resulting distilled data. This dataset reproduces a model with high generalization ability for a required memory complexity of $\tildeΞ$$(r^2d+L)$, where $d$ and $r$ are the input and intrinsic dimensions of the task. To the best of our knowledge, this is one of the first theoretical works that include a specific task structure, leverage its intrinsic dimensionality to quantify the compression rate and study dataset distillation implemented solely via gradient-based algorithms.