🤖 AI Summary
Large language models (e.g., Granite-20B-Code, StarCoder) exhibit high error rates and limited error recovery when generating Qiskit quantum circuit code. Method: We propose a reinforcement learning–based reliability enhancement approach, fine-tuning a 32B-parameter model on a high-quality synthetic dataset using two preference optimization algorithms—GRPO and ORPO. Contribution/Results: On the Qiskit HumanEval benchmark, ORPO achieves 56.29% Pass@1, while GRPO attains 49.0%, substantially outperforming existing general-purpose baselines—particularly on foundational and medium-difficulty tasks. This work represents the first systematic application of ORPO to domain-specific quantum programming code generation, empirically validating its efficacy in this setting. It establishes a novel paradigm for improving automation reliability in quantum software development.
📝 Abstract
Quantum circuits must be error-resilient, yet LLMs like Granite-20B-Code and StarCoder often output flawed Qiskit code. We fine-tuned a 32 B model with two RL methods, Group Relative Policy Optimization (GRPO) and Odds-Ratio Preference Optimization (ORPO), using a richly annotated synthetic dataset. On the Qiskit HumanEval benchmark, ORPO reaches 56.29% Pass@1 ($approx+10$ pp over Granite-8B-QK) and GRPO hits 49%, both beating all general-purpose baselines; on the original HumanEval they score 65.90% and 63.00%. GRPO excels on basic tasks (42/54), ORPO on intermediate ones (41/68), and neither solves the five advanced tasks, highlighting clear gains yet room for progress in AI-assisted quantum programming.