ReLoop: Structured Modeling and Behavioral Verification for Reliable LLM-Based Optimization

📅 2026-02-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the “feasibility–correctness gap” in large language models (LLMs) generating optimization code—where outputs are syntactically executable but semantically incorrect—by introducing a structured four-stage chain-of-thought reasoning framework and an unsupervised behavioral verification mechanism. The reasoning framework emulates expert modeling practices through phased deliberation, while the verification mechanism detects semantic errors via solver parameter perturbation without requiring ground-truth labels, further enabling execution recovery through Irreducible Inconsistent Subsystem (IIS) diagnosis. Evaluated across five state-of-the-art LLMs and three benchmarks, the approach consistently improves performance, boosting the top model’s correctness rate from 22.6% to 31.1% and achieving a 100% execution success rate, up from 72.1%. The study also releases the RetailOpt-190 dataset to support future research.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) can translate natural language into optimization code, but silent failures pose a critical risk: code that executes and returns solver-feasible solutions may encode semantically incorrect formulations, creating a feasibility-correctness gap of up to 90 percentage points on compositional problems. We introduce ReLoop, addressing silent failures from two complementary directions. Structured generation decomposes code production into a four-stage reasoning chain (understand, formalize, synthesize, verify) that mirrors expert modeling practice, with explicit variable-type reasoning and self-verification to prevent formulation errors at their source. Behavioral verification detects errors that survive generation by testing whether the formulation responds correctly to solver-based parameter perturbation, without requiring ground truth -- an external semantic signal that bypasses the self-consistency problem inherent in LLM-based code review. The two mechanisms are complementary: structured generation dominates on complex compositional problems, while behavioral verification becomes the largest single contributor on problems with localized formulation defects. Together with execution recovery via IIS-enhanced diagnostics, ReLoop raises correctness from 22.6% to 31.1% and execution from 72.1% to 100.0% on the strongest model, with consistent gains across five models spanning three paradigms (foundation, SFT, RL) and three benchmarks. We additionally release RetailOpt-190, 190 compositional retail optimization scenarios targeting the multi-constraint interactions where LLMs most frequently fail.
Problem

Research questions and friction points this paper is trying to address.

silent failures
feasibility-correctness gap
LLM-based optimization
semantic correctness
compositional problems
Innovation

Methods, ideas, or system contributions that make the work stand out.

structured generation
behavioral verification
silent failure
LLM-based optimization
parameter perturbation
🔎 Similar Papers
No similar papers found.
J
Junbo Jacob Lian
Northwestern University, Wenzhou Buyi Pharmacy Chain Co., Ltd.
Y
Yujun Sun
Northwestern University
H
Huiling Chen
Wenzhou University
C
Chaoyu Zhang
City University of Hong Kong
Chung-Piaw Teo
Chung-Piaw Teo
NUS
OperationsOptimization