🤖 AI Summary
This work addresses key limitations in deep reinforcement learning for quantum circuit optimization—namely, the disregard of temporal-difference (TD) target reliability in replay buffers, the high cost of full quantum-classical evaluations at every curriculum learning step, and the discarding of noise-free trajectories during retraining under hardware noise. Centered on the replay buffer, the authors propose ReaPER+, a dynamic prioritized replay strategy that integrates TD error with reliability-aware sampling; OptCRLQAS, a multi-edit amortized evaluation framework; and a lightweight, weight-free buffer transfer mechanism. Evaluated on quantum compilation and quantum architecture search tasks, the approach achieves 4–32× higher sample efficiency, reduces per-iteration runtime by 67.5% on 12-qubit problems, decreases the number of steps required to reach chemical accuracy in molecular ground-state energy estimation by 85–90%, and lowers energy errors by up to 90%.
📝 Abstract
Deep reinforcement learning (RL) for quantum circuit optimization faces three fundamental bottlenecks: replay buffers that ignore the reliability of temporal-difference (TD) targets, curriculum-based architecture search that triggers a full quantum-classical evaluation at every environment step, and the routine discard of noiseless trajectories when retraining under hardware noise. We address all three by treating the replay buffer as a primary algorithmic lever for quantum optimization. We introduce ReaPER$+$, an annealed replay rule that transitions from TD error-driven prioritization early in training to reliability-aware sampling as value estimates mature, achieving $4-32\times$ gains in sample efficiency over fixed PER, ReaPER, and uniform replay while consistently discovering more compact circuits across quantum compilation and QAS benchmarks; validation on LunarLander-v3 confirms the principle is domain-agnostic. Furthermore we eliminate the quantum-classical evaluation bottleneck in curriculum RL by introducing OptCRLQAS which amortizes expensive evaluations over multiple architectural edits, cutting wall-clock time per episode by up to $67.5\%$ on a 12-qubit optimization problem without degrading solution quality. Finally we introduce a lightweight replay-buffer transfer scheme that warm-starts noisy-setting learning by reusing noiseless trajectories, without network-weight transfer or $ε$-greedy pretraining. This reduces steps to chemical accuracy by up to $85-90\%$ and final energy error by up to $90\%$ over from-scratch baselines on 6-, 8-, and 12-qubit molecular tasks. Together, these results establish that experience storage, sampling, and transfer are decisive levers for scalable, noise-robust quantum circuit optimization.