HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization

📅 2025-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM-based combinatorial optimization heuristic generation benchmarks suffer from closed-loop saturation, subjective evaluation, and insufficient methodological rigor. Method: We introduce HeuriGym—the first proxy-based benchmark specifically designed for combinatorial optimization—enabling LLMs to autonomously design heuristics, execute them in a sandboxed environment, and iteratively refine solutions through multi-turn reflection and optimization across nine real-world domains (e.g., computer systems, logistics, bioinformatics). We propose the Quality-Yield Index (QYI), a unified metric quantifying both solution quality and generation efficiency, and establish a closed “generate–execute–feedback–improve” agent paradigm. Results: Evaluating nine state-of-the-art LLMs across all nine problems, the highest observed QYI is 0.6 (relative to expert baseline = 1.0), exposing systemic limitations in tool utilization, hierarchical planning, and adaptive reasoning. The benchmark is publicly released to advance LLMs toward scientific-grade combinatorial problem solving.

Technology Category

Application Category

📝 Abstract
While Large Language Models (LLMs) have demonstrated significant advancements in reasoning and agent-based problem-solving, current evaluation methodologies fail to adequately assess their capabilities: existing benchmarks either rely on closed-ended questions prone to saturation and memorization, or subjective comparisons that lack consistency and rigor. In this work, we introduce HeuriGym, an agentic framework designed for evaluating heuristic algorithms generated by LLMs for combinatorial optimization problems, characterized by clearly defined objectives and expansive solution spaces. HeuriGym empowers LLMs to propose heuristics, receive evaluative feedback via code execution, and iteratively refine their solutions. We evaluate nine state-of-the-art models on nine problems across domains such as computer systems, logistics, and biology, exposing persistent limitations in tool use, planning, and adaptive reasoning. To quantify performance, we propose the Quality-Yield Index (QYI), a metric that captures both solution pass rate and quality. Even top models like GPT-o4-mini-high and Gemini-2.5-Pro attain QYI scores of only 0.6, well below the expert baseline of 1. Our open-source benchmark aims to guide the development of LLMs toward more effective and realistic problem-solving in scientific and engineering domains.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM-generated heuristics for combinatorial optimization problems
Addressing limitations in current LLM evaluation methodologies
Introducing a metric (QYI) to quantify solution pass rate and quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic framework for LLM heuristic evaluation
Iterative refinement via code execution feedback
Quality-Yield Index metric for performance quantification
🔎 Similar Papers
No similar papers found.