DualSchool: How Reliable are LLMs for Optimization Education?

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work evaluates the reliability of large language models (LLMs) in a foundational operations research pedagogical task: converting linear programming primal problems to their duals (P2DC). Addressing high false-positive rates and the absence of formal verification in existing evaluation methods, we propose a novel automatic verification approach based on *canonical graph edit distance* (CGED), integrated within a comprehensive assessment framework supporting instance generation, rigorous correctness validation, and fine-grained error attribution. Our systematic evaluation—first of its kind—reveals that even mainstream open-weight LLMs commit frequent, fundamental errors on minimal two-variable instances. Moreover, they exhibit significant fragility in auxiliary tasks such as correctness judgment and error localization. These findings challenge the prevailing assumption of LLMs’ reliability as optimization teaching aids, providing both theoretical grounding and empirical benchmarks for trustworthy LLM deployment in educational contexts.

Technology Category

Application Category

📝 Abstract
Consider the following task taught in introductory optimization courses which addresses challenges articulated by the community at the intersection of (generative) AI and OR: generate the dual of a linear program. LLMs, being trained at web-scale, have the conversion process and many instances of Primal to Dual Conversion (P2DC) at their disposal. Students may thus reasonably expect that LLMs would perform well on the P2DC task. To assess this expectation, this paper introduces DualSchool, a comprehensive framework for generating and verifying P2DC instances. The verification procedure of DualSchool uses the Canonical Graph Edit Distance, going well beyond existing evaluation methods for optimization models, which exhibit many false positives and negatives when applied to P2DC. Experiments performed by DualSchool reveal interesting findings. Although LLMs can recite the conversion procedure accurately, state-of-the-art open LLMs fail to consistently produce correct duals. This finding holds even for the smallest two-variable instances and for derivative tasks, such as correctness, verification, and error classification. The paper also discusses the implications for educators, students, and the development of large reasoning systems.
Problem

Research questions and friction points this paper is trying to address.

Assessing LLM reliability for linear program dual generation
Evaluating LLM performance on Primal to Dual Conversion tasks
Identifying LLM limitations in optimization education applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

DualSchool framework for P2DC generation
Canonical Graph Edit Distance verification
Evaluates LLMs on dual conversion accuracy
🔎 Similar Papers
No similar papers found.
Michael Klamkin
Michael Klamkin
Georgia Institute of Technology, AI4OPT
machine learningconstrained optimization
Arnaud Deza
Arnaud Deza
Georgia Institute of Technology, AI4OPT
Combinatorial OptimizationMachine LearningLarge Language Models
S
Sikai Cheng
NSF AI Institute for Advances in Optimization, Georgia Institute of Technology, Atlanta, GA, USA
H
Haoruo Zhao
NSF AI Institute for Advances in Optimization, Georgia Institute of Technology, Atlanta, GA, USA
P
P. V. Hentenryck
NSF AI Institute for Advances in Optimization, Georgia Institute of Technology, Atlanta, GA, USA