🤖 AI Summary
Existing dataset condensation (DC) methods align condensed data with full SGD optimization trajectories, suffering from high noise, large curvature, and substantial memory overhead—hindering efficient and privacy-preserving modeling in clinical settings. This paper proposes a novel path-proxy method based on quadratic Bézier curves, the first to incorporate mode connectivity into DC: it constructs a smooth, low-curvature, noise-free parameter path anchored at the initial and final model weights, and achieves dynamic trajectory alignment via gradient matching—eliminating the need to store the full training trajectory. The approach significantly improves optimization stability and convergence speed. Evaluated on five clinical datasets, it surpasses state-of-the-art methods: synthetic data yield an average 3.2% improvement in downstream task performance, while reducing memory consumption by 72% and training time by 41%.
📝 Abstract
Dataset condensation (DC) enables the creation of compact, privacy-preserving synthetic datasets that can match the utility of real patient records, supporting democratised access to highly regulated clinical data for developing downstream clinical models. State-of-the-art DC methods supervise synthetic data by aligning the training dynamics of models trained on real and those trained on synthetic data, typically using full stochastic gradient descent (SGD) trajectories as alignment targets; however, these trajectories are often noisy, high-curvature, and storage-intensive, leading to unstable gradients, slow convergence, and substantial memory overhead. We address these limitations by replacing full SGD trajectories with smooth, low-loss parametric surrogates, specifically quadratic Bézier curves that connect the initial and final model states from real training trajectories. These mode-connected paths provide noise-free, low-curvature supervision signals that stabilise gradients, accelerate convergence, and eliminate the need for dense trajectory storage. We theoretically justify Bézier-mode connections as effective surrogates for SGD paths and empirically show that the proposed method outperforms state-of-the-art condensation approaches across five clinical datasets, yielding condensed datasets that enable clinically effective model development.