🤖 AI Summary
This study investigates why small-population evolutionary strategies effectively fine-tune large language models and uncovers the non-monotonic phenomenon in which reward first increases and then decreases during fine-tuning. By employing weight-perturbation-based evolutionary strategies as geometric probes and integrating a second-order stochastic ascent model, the authors conduct empirical analyses on GSM8K, ARC-C, and WinoGrande tasks across the Qwen2.5-Instruct series (0.5B–7B parameters). They find that the fine-tuning landscape exhibits a low-dimensional structure in curvature, with optimization primarily progressing along a few high-curvature directions. This low effective dimensionality of curvature coherently explains both the efficacy of small-population strategies and the non-monotonic training dynamics, challenging conventional pessimism about high-dimensional optimization. Remarkably, performance gains are achievable with only around 30 perturbation samples, revealing broader optimization pathways for high-dimensional fine-tuning.
📝 Abstract
Weight-perturbation evolution strategies (ES) can fine-tune billion-parameter language models with surprisingly small populations (e.g., $N\!\approx\!30$), contradicting classical zeroth-order curse-of-dimensionality intuition. We also observe a second seemingly separate phenomenon: under fixed hyperparameters, the stochastic fine-tuning reward often rises, peaks, and then degrades in both ES and GRPO. We argue that both effects reflect a shared geometric property of fine-tuning landscapes: they are low-dimensional in curvature. A small set of high-curvature dimensions dominates improvement, producing (i) heterogeneous time scales that yield rise-then-decay under fixed stochasticity, as captured by a minimal quadratic stochastic-ascent model, and (ii) degenerate improving updates, where many random perturbations share similar components along these directions. Using ES as a geometric probe on fine-tuning reward landscapes of GSM8K, ARC-C, and WinoGrande across Qwen2.5-Instruct models (0.5B--7B), we show that reward-improving perturbations remain empirically accessible with small populations across scales. Together, these results reconcile ES scalability with non-monotonic training dynamics and suggest that high-dimensional fine-tuning may admit a broader class of viable optimization methods than worst-case theory implies.