When Words Change the Model: Sensitivity of LLMs for Constraint Programming Modelling

📅 2025-11-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether large language models (LLMs) possess deep semantic understanding in constraint programming modeling or merely rely on superficial patterns from training data. To assess robustness to linguistic variation, we design natural-language rewrites of classic problems that preserve semantic meaning while perturbing contextual cues. Using systematic prompt engineering, cross-model comparative analysis, and executable validation, we find that mainstream LLMs generate syntactically correct models but exhibit significant performance degradation under minor contextual rephrasing—revealing pronounced lexical sensitivity and modeling fragility. Our key contribution is the first application of controlled linguistic perturbations to evaluate constraint modeling competence; empirical evidence demonstrates that performance decline stems primarily from training data contamination—not inherent reasoning limitations—thereby establishing a critical diagnostic framework for assessing LLM reliability in formal, logic-based tasks.

Technology Category

Application Category

📝 Abstract
One of the long-standing goals in optimisation and constraint programming is to describe a problem in natural language and automatically obtain an executable, efficient model. Large language models appear to bring this vision closer, showing impressive results in automatically generating models for classical benchmarks. However, much of this apparent success may derive from data contamination rather than genuine reasoning: many standard CP problems are likely included in the training data of these models. To examine this hypothesis, we systematically rephrased and perturbed a set of well-known CSPLib problems to preserve their structure while modifying their context and introducing misleading elements. We then compared the models produced by three representative LLMs across original and modified descriptions. Our qualitative analysis shows that while LLMs can produce syntactically valid and semantically plausible models, their performance drops sharply under contextual and linguistic variation, revealing shallow understanding and sensitivity to wording.
Problem

Research questions and friction points this paper is trying to address.

Examining LLM sensitivity to wording changes in constraint programming problems
Testing if LLM success derives from data contamination versus genuine reasoning
Assessing performance drop when problem context and linguistic elements vary
Innovation

Methods, ideas, or system contributions that make the work stand out.

Rephrased CSPLib problems to test LLMs
Compared models from three LLMs systematically
Revealed sensitivity to wording and shallow understanding