Beyond the Hype: Benchmarking LLM-Evolved Heuristics for Bin Packing

📅 2025-01-20

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This study systematically evaluates the generalization capability of large language model (LLM)-generated heuristics for the bin packing problem. Addressing a large-scale, heterogeneous benchmark suite, we propose the first rigorous, cross-benchmark evaluation framework—incorporating multiple metrics (solution quality, stability, robustness) and heuristic categories (classical, modern, randomized). We further introduce a novel “winning-instance-driven” instance-space characterization method to precisely delineate performance boundaries. Empirical results show that most LLM-generated heuristics underperform classical simple heuristics (e.g., First-Fit), with advantages confined to narrow, specific subspaces of instances; moreover, the computational cost of LLM-based heuristic evolution substantially outweighs its marginal performance gains. Our core contributions are: (1) establishing a new paradigm for evaluating generalization in LLM-derived heuristics, and (2) revealing fundamental limitations in current LLMs’ generalization capacity for combinatorial optimization.

Technology Category

Application Category

📝 Abstract

Coupling Large Language Models (LLMs) with Evolutionary Algorithms has recently shown significant promise as a technique to design new heuristics that outperform existing methods, particularly in the field of combinatorial optimisation. An escalating arms race is both rapidly producing new heuristics and improving the efficiency of the processes evolving them. However, driven by the desire to quickly demonstrate the superiority of new approaches, evaluation of the new heuristics produced for a specific domain is often cursory: testing on very few datasets in which instances all belong to a specific class from the domain, and on few instances per class. Taking bin-packing as an example, to the best of our knowledge we conduct the first rigorous benchmarking study of new LLM-generated heuristics, comparing them to well-known existing heuristics across a large suite of benchmark instances using three performance metrics. For each heuristic, we then evolve new instances won by the heuristic and perform an instance space analysis to understand where in the feature space each heuristic performs well. We show that most of the LLM heuristics do not generalise well when evaluated across a broad range of benchmarks in contrast to existing simple heuristics, and suggest that any gains from generating very specialist heuristics that only work in small areas of the instance space need to be weighed carefully against the considerable cost of generating these heuristics.

Problem

Research questions and friction points this paper is trying to address.

Algorithm Performance

Combination Optimization

Packing Problem

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models

Evolutionary Algorithms

Combinatorial Optimization

🔎 Similar Papers

Multi-objective Evolution of Heuristic Using Large Language Model