CombiBench: Benchmarking LLM Capability for Combinatorial Mathematics

📅 2025-05-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) lack standardized, formalized evaluation benchmarks for combinatorial mathematics. Method: We introduce CombiBench—the first Lean 4–based formal benchmark for combinatorics—comprising 100 proof and fill-in-the-blank problems spanning primary to International Mathematical Olympiad (IMO) difficulty levels across十余 combinatorial topics; we further propose Fine-Eval, the first automated framework enabling precise scoring of formalized fill-in-the-blank problems, and comprehensively formalize all non-geometric IMO combinatorics problems since 2000. Contribution/Results: Experiments reveal severely limited zero-shot formal solving capabilities of current LLMs (maximum 7/100 solved), underscoring the profound challenge of formal combinatorial reasoning. CombiBench fills a critical gap in the field, providing a reproducible, extensible benchmark and evaluation infrastructure to advance research in formal mathematical reasoning.

Technology Category

Application Category

📝 Abstract
Neurosymbolic approaches integrating large language models with formal reasoning have recently achieved human-level performance on mathematics competition problems in algebra, geometry and number theory. In comparison, combinatorics remains a challenging domain, characterized by a lack of appropriate benchmarks and theorem libraries. To address this gap, we introduce CombiBench, a comprehensive benchmark comprising 100 combinatorial problems, each formalized in Lean~4 and paired with its corresponding informal statement. The problem set covers a wide spectrum of difficulty levels, ranging from middle school to IMO and university level, and span over ten combinatorial topics. CombiBench is suitable for testing IMO solving capabilities since it includes all IMO combinatorial problems since 2000 (except IMO 2004 P3 as its statement contain an images). Furthermore, we provide a comprehensive and standardized evaluation framework, dubbed Fine-Eval (for $ extbf{F}$ill-in-the-blank $ extbf{in}$ L$ extbf{e}$an Evaluation), for formal mathematics. It accommodates not only proof-based problems but also, for the first time, the evaluation of fill-in-the-blank questions. Using Fine-Eval as the evaluation method and Kimina Lean Server as the backend, we benchmark several LLMs on CombiBench and observe that their capabilities for formally solving combinatorial problems remain limited. Among all models tested (none of which has been trained for this particular task), Kimina-Prover attains the best results, solving 7 problems (out of 100) under both ``with solution'' and ``without solution'' scenarios. We open source the benchmark dataset alongside with the code of the proposed evaluation method at https://github.com/MoonshotAI/CombiBench/.
Problem

Research questions and friction points this paper is trying to address.

Benchmarking LLMs on combinatorial math lacking benchmarks
Evaluating formal solving of 100 diverse combinatorics problems
Assessing LLM performance with Fine-Eval framework in Lean 4
Innovation

Methods, ideas, or system contributions that make the work stand out.

Neurosymbolic integration of LLMs with formal reasoning
CombiBench benchmark with 100 Lean~4 formalized problems
Fine-Eval framework for fill-in-the-blank evaluation
🔎 Similar Papers
No similar papers found.
J
Junqi Liu
Academy of Mathematics and Systems Science, University of Chinese Academy of Sciences
X
Xiaohan Lin
Sun Yat-sen University
J
Jonas Bayer
University of Cambridge
Y
Yael Dillies
Stockholm Universitet
W
Weijie Jiang
East China Normal University
Xiaodan Liang
Xiaodan Liang
Professor of Computer Science, Sun Yat-sen University, MBZUAI, CMU, NUS
Computer visionEmbodied AIMachine learning
R
Roman Soletskyi
Haiming Wang
Haiming Wang
Professor at the School of Information Science and Engineering, Southeast University
Antenna & Radio FrequencyRadio PropagationNonlinear Wireless Communications
Y
Yunzhou Xie
Imperial College London
B
Beibei Xiong
East China Normal University
Zhengfeng Yang
Zhengfeng Yang
East China Normal University
Symbolic ComputationFormal Methods
J
Jujian Zhang
Lihong Zhi
Lihong Zhi
Academy of Mathematics and Systems Science
computer algebraoptimizatoncertified computation
J
Jia Li
Numina
Z
Zhengying Liu
Moonshot AI