Turning Stale Gradients into Stable Gradients: Coherent Coordinate Descent with Implicit Landscape Smoothing for Lightweight Zeroth-Order Optimization

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the longstanding challenge in zeroth-order optimization of simultaneously achieving high sample efficiency and low-variance gradient estimation. The authors propose Coherent Coordinate Descent (CoCD), a deterministic method that leverages block cyclic coordinate descent to reuse historical gradient information and employs large-step finite differences to implicitly smooth the loss landscape. Remarkably, CoCD maintains a globally descending direction while requiring only a constant number of function queries per iteration. By repurposing stale gradients as optimization assets rather than discarding them, the method overcomes fundamental stability and efficiency limitations inherent in conventional stochastic zeroth-order approaches. Empirical evaluations demonstrate that CoCD consistently outperforms both BCCD and state-of-the-art stochastic zeroth-order algorithms across MLP, CNN, and ResNet architectures—scaling up to 270K parameters—with superior performance in sample efficiency, convergence accuracy, and numerical stability.

📝 Abstract

Zeroth-Order (ZO) optimization is pivotal for scenarios where backpropagation is unavailable, such as memory-constrained on-device learning and black-box optimization. However, existing methods face a stark trade-off: they are either sample-inefficient (e.g., standard finite differences) or suffer from high variance due to randomized estimation (e.g., random subspace methods). In this work, we propose Coherent Coordinate Descent (CoCD), a deterministic, sample-efficient, and budget-aware ZO optimizer. Theoretically, we formalize the notion of gradient coherence and demonstrate that CoCD is equivalent to Block Cyclic Coordinate Descent (BCCD) with ``warm starts,'' effectively converting historical (stale) gradients from a liability into a computational asset. This mechanism enables $O(1)$ query complexity per step while maintaining global descent directions. Furthermore, we derive error bounds revealing a counter-intuitive insight: larger finite-difference step sizes can induce an implicit smoothing effect on the optimization landscape by reducing the effective smoothness constant, thereby improving convergence stability. Experiments on MLP, CNN, and ResNet architectures (up to 270k parameters) demonstrate that CoCD significantly outperforms BCCD in terms of sample efficiency and convergence loss/accuracy, and exhibits superior stability over randomized ZO methods. Our results suggest that deterministic, structure-aware updates offer a superior alternative to randomization for lightweight ZO optimization.

Problem

Research questions and friction points this paper is trying to address.

Zeroth-Order Optimization

Sample Efficiency

Gradient Variance

Finite Differences

Black-Box Optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Coherent Coordinate Descent

Zeroth-Order Optimization

Gradient Coherence