OckBench: Measuring the Efficiency of LLM Reasoning

📅 2025-11-07

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Existing large language model (LLM) benchmarks predominantly emphasize accuracy while neglecting token decoding efficiency—a critical dimension affecting inference latency, computational cost, and energy consumption. Method: We introduce OckBench, the first model- and hardware-agnostic benchmark that jointly evaluates both accuracy and token efficiency. It pioneers token count as a core evaluation metric and establishes a unified framework for joint assessment across reasoning and programming tasks. We further propose the accuracy–efficiency Pareto frontier to quantify trade-offs. Contribution/Results: Our analysis reveals up to several-fold efficiency disparities among leading models (e.g., GPT-4, Claude 3, Gemini) at comparable accuracy levels. We release an open-source evaluation platform to catalyze paradigm shifts toward efficient inference research. Empirical validation confirms token efficiency as both a necessary and actionable evaluation dimension—distinct from, yet complementary to, traditional accuracy metrics.

Technology Category

Application Category

📝 Abstract

Large language models such as GPT-4, Claude 3, and the Gemini series have improved automated reasoning and code generation. However, existing benchmarks mainly focus on accuracy and output quality, and they ignore an important factor: decoding token efficiency. In real systems, generating 10,000 tokens versus 100,000 tokens leads to large differences in latency, cost, and energy. In this work, we introduce OckBench, a model-agnostic and hardware-agnostic benchmark that evaluates both accuracy and token count for reasoning and coding tasks. Through experiments comparing multiple open- and closed-source models, we uncover that many models with comparable accuracy differ wildly in token consumption, revealing that efficiency variance is a neglected but significant axis of differentiation. We further demonstrate Pareto frontiers over the accuracy-efficiency plane and argue for an evaluation paradigm shift: we should no longer treat tokens as"free"to multiply. OckBench provides a unified platform for measuring, comparing, and guiding research in token-efficient reasoning. Our benchmarks are available at https://ockbench.github.io/ .

Problem

Research questions and friction points this paper is trying to address.

Benchmarks ignore decoding token efficiency in reasoning tasks

OckBench evaluates both accuracy and token count for reasoning

Models with similar accuracy vary significantly in token consumption

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces a benchmark measuring token efficiency

Evaluates both accuracy and token consumption

Provides unified platform for efficiency research

🔎 Similar Papers

No similar papers found.