🤖 AI Summary
Large language models (LLMs) are increasingly applied to automated algorithm design, yet their implicit assumptions in constructing evolutionary multi-objective optimization (EMO) benchmarks remain poorly understood. Method: We employ systematic prompt engineering to elicit LLMs to autonomously design full EMO evaluation frameworks—including test problems (ZDT/DTLZ/WFG), performance metrics (HV/IGD), algorithm selections (NSGA-II/MOEA/D/NSGA-III), and parameter configurations—and analyze their default preferences. Contribution/Results: Our study reveals a strong convergence toward classical benchmark configurations: 92% of generated setups replicate established combinations, with negligible proposal of novel paradigms. This demonstrates that LLM-driven algorithm design is tightly coupled with domain-specific prior knowledge, severely limiting generalizability and reproducibility. The work provides critical methodological caution for LLM-augmented optimization research and foundational insights for rethinking benchmark design principles in automated algorithm discovery.
📝 Abstract
When we manually design an evolutionary optimization algorithm, we implicitly or explicitly assume a set of target optimization problems. In the case of automated algorithm design, target optimization problems are usually explicitly shown. Recently, the use of large language models (LLMs) for the design of evolutionary multi-objective optimization (EMO) algorithms have been examined in some studies. In those studies, target multi-objective problems are not always explicitly shown. It is well known in the EMO community that the performance evaluation results of EMO algorithms depend on not only test problems but also many other factors such as performance indicators, reference point, termination condition, and population size. Thus, it is likely that the designed EMO algorithms by LLMs depends on those factors. In this paper, we try to examine the implicit assumption about the performance comparison of EMO algorithms in LLMs. For this purpose, we ask LLMs to design a benchmarking experiment of EMO algorithms. Our experiments show that LLMs often suggest classical benchmark settings: Performance examination of NSGA-II, MOEA/D and NSGA-III on ZDT, DTLZ and WFG by HV and IGD under the standard parameter specifications.