BeyondBench: Benchmark-Free Evaluation of Reasoning in Language Models

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Existing reasoning benchmarks suffer from training data contamination, causing models to memorize answers rather than demonstrate genuine reasoning capabilities. Method: We propose BeyondBench—the first contamination-free evaluation framework for language model reasoning. It dynamically generates foundational mathematical problems (spanning arithmetic to NP-hard tasks) via algorithmic synthesis, yielding a verifiable combinatorial space exceeding 10¹⁵ instances across 44 distinct reasoning categories. Evaluation employs formal proof-driven answer verification and multi-level difficulty stratification. Contribution/Results: Evaluated on 101 models, BeyondBench reveals critical limitations: state-of-the-art models—including Gemini-2.5-Pro and Llama-3.3-70B—achieve ≤57% accuracy on high-difficulty tasks, exposing severe deficits in complex reasoning. Crucially, integrating tool-augmented execution substantially improves performance. BeyondBench establishes a new paradigm for trustworthy, contamination-resistant reasoning assessment, enabling rigorous, scalable, and formally grounded model evaluation.

Technology Category

Application Category

📝 Abstract

Evaluating language models fairly is becoming harder as static benchmarks available on the internet risk contamination by training data. This makes it unclear whether models are truly reasoning or just recalling answers. In this paper, we introduce BeyondBench, an evaluation framework that avoids this problem by using algorithmic problem generation. Unlike traditional benchmarks that risk contamination from internet-scale training data, BeyondBench creates mathematically grounded problems on the fly, ensuring each test remains fresh and uncontaminated. Our framework covers 44 algorithmic tasks with a total of 117 variations, grouped into three difficulty levels: the Easy Suite (29 tasks) for basic arithmetic and statistics, the Medium Suite (5 tasks, 49 variations) for sequence patterns and reasoning, and the Hard Suite (10 tasks, 68 variations) tackling NP-complete and constraint satisfaction problems. Each task generates problems from a combinatorial space larger than 10^15 unique instances, with solutions verified deterministically by mathematical proofs. We evaluated 101 language models, including 85 open-source and 16 closed-source models, spanning sizes from 0.5B to 141B parameters and multiple quantization schemes. Our results show consistent reasoning deficiencies across model families, with performance degrading sharply as problem complexity increases from polynomial to exponential. In our Hard Suite evaluations, models such as Gemini-2.5-pro, Llama-3.3-70B, and Qwen2.5-72B achieved average accuracies of 56.38%, 26.91%, and 33.60%, respectively. Moreover, we observe that performance drops drastically without tool usage, with GPT-5, GPT-5-mini, and GPT-5-nano showing a decline of 16.81%, 28.05%, and 47.59% accuracy on the hard suite. Our leaderboard is publicly available at https://ctrl-gaurav.github.io/BeyondBench/

Problem

Research questions and friction points this paper is trying to address.

Evaluating language models without internet data contamination risks

Testing true reasoning versus memorization through algorithmic problems

Assessing model performance across polynomial to exponential complexity tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates algorithmic problems dynamically to avoid data contamination

Uses mathematical proofs for deterministic solution verification

Covers 44 tasks across three difficulty levels systematically

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting