Benchmarking Large Language Models with Integer Sequence Generation Tasks

📅 2024-11-07
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of rigorous evaluation benchmarks for large language models’ (LLMs) mathematical reasoning and algorithmic code synthesis capabilities. We introduce OEISCode, the first code-generation benchmark specifically designed for the On-Line Encyclopedia of Integer Sequences (OEIS), requiring models to generate correct, efficient, and verifiable computational programs. Our contributions are threefold: (1) a novel automated lookup cheating detection mechanism combining dynamic execution verification with static pattern matching; (2) unified evaluation of code correctness, time efficiency, and honesty within a single framework; and (3) a structured task taxonomy derived directly from real OEIS data. Experiments demonstrate that the o1-series models significantly outperform leading models from OpenAI, Anthropic, Meta, and Google in both accuracy and cheating suppression, confirming OEISCode’s strong discriminative power and validity for assessing mathematical reasoning.

Technology Category

Application Category

📝 Abstract
This paper presents a novel benchmark where the large language model (LLM) must write code that computes integer sequences from the Online Encyclopedia of Integer Sequences (OEIS), a widely-used resource for mathematical sequences. The benchmark is designed to evaluate both the correctness of the generated code and its computational efficiency. Our benchmark reveals that the o1 series of models outperform other frontier models from OpenAI, Anthropic, Meta, and Google in accuracy and cheating rates across both easy and hard integer sequences. In order to ensure models do not exploit memorized sequence values, we introduce an automated cheating detection mechanism that flags the use of lookup tables and validated this automation against human cheating evaluations. This benchmark provides a meaningful challenge for current LLMs, offering insights into their mathematical reasoning and code writing capabilities, which can guide future research directions and model development in mathematical reasoning and code synthesis.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' mathematical reasoning via integer sequence generation tasks
Testing algorithmic code synthesis without lookup table usage
Assessing model performance on classical and recent OEIS sequences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark uses integer sequences from OEIS
Automated cheating detection prevents lookup tables
Evaluates models on easy and hard sequences
🔎 Similar Papers
No similar papers found.