🤖 AI Summary
This study investigates whether large language models (LLMs) can generate quantum solvers for scientific problems that not only execute successfully but also produce numerically accurate results. To address this, the authors propose Q-SAGE, a method that iteratively executes LLM-generated quantum scripts, compares their outputs against classical solvers, and automatically refines the scripts until they meet a prescribed accuracy threshold. Q-SAGE introduces the first iterative evaluation framework explicitly designed to enforce scientific correctness, revealing a shift in LLM failure modes from execution errors to subtle numerical inaccuracies. Experiments across five scientific problem classes and five LLMs demonstrate that iterative refinement substantially improves solution success rates, albeit at considerable computational cost; notably, more capable models are more prone to generating numerically inaccurate solutions rather than failing to execute.
📝 Abstract
Large Language Models (LLMs) show strong capabilities in code generation, motivating their use in automated quantum solver development. However, in quantum computing, successful execution of generated code is not sufficient: correctness depends on numerically accurate results, which are sensitive to non-trivial mappings, hybrid quantum-classical workflows, and algorithm-specific approximations. This work introduces Q-SAGE, an iterative methodology to evaluate LLMs' capability in generating quantum solvers for scientific problems. The methodology adopts an iterative approach by executing the script generated by the LLM, comparing the result with the result of a classical solver, and refining the script until the two results match within a tolerance threshold. We empirically evaluated the methodology with five families of scientific problems of different complexities and five LLMs, both open source and proprietary. The results show that iterative refinement substantially improves success rates, but introduces a significant computational overhead. Moreover, as model capability increases, failure modes shift from execution errors to numerical inaccuracies, highlighting the current limitations of LLM-based quantum software.