OckBench: Measuring the Efficiency of LLM Reasoning

📅 2025-11-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing large language model (LLM) benchmarks predominantly emphasize accuracy while neglecting token decoding efficiency—a critical dimension affecting inference latency, computational cost, and energy consumption. Method: We introduce OckBench, the first model- and hardware-agnostic benchmark that jointly evaluates both accuracy and token efficiency. It pioneers token count as a core evaluation metric and establishes a unified framework for joint assessment across reasoning and programming tasks. We further propose the accuracy–efficiency Pareto frontier to quantify trade-offs. Contribution/Results: Our analysis reveals up to several-fold efficiency disparities among leading models (e.g., GPT-4, Claude 3, Gemini) at comparable accuracy levels. We release an open-source evaluation platform to catalyze paradigm shifts toward efficient inference research. Empirical validation confirms token efficiency as both a necessary and actionable evaluation dimension—distinct from, yet complementary to, traditional accuracy metrics.

Technology Category

Application Category

📝 Abstract
Large language models such as GPT-4, Claude 3, and the Gemini series have improved automated reasoning and code generation. However, existing benchmarks mainly focus on accuracy and output quality, and they ignore an important factor: decoding token efficiency. In real systems, generating 10,000 tokens versus 100,000 tokens leads to large differences in latency, cost, and energy. In this work, we introduce OckBench, a model-agnostic and hardware-agnostic benchmark that evaluates both accuracy and token count for reasoning and coding tasks. Through experiments comparing multiple open- and closed-source models, we uncover that many models with comparable accuracy differ wildly in token consumption, revealing that efficiency variance is a neglected but significant axis of differentiation. We further demonstrate Pareto frontiers over the accuracy-efficiency plane and argue for an evaluation paradigm shift: we should no longer treat tokens as"free"to multiply. OckBench provides a unified platform for measuring, comparing, and guiding research in token-efficient reasoning. Our benchmarks are available at https://ockbench.github.io/ .
Problem

Research questions and friction points this paper is trying to address.

Benchmarks ignore decoding token efficiency in reasoning tasks
OckBench evaluates both accuracy and token count for reasoning
Models with similar accuracy vary significantly in token consumption
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces a benchmark measuring token efficiency
Evaluates both accuracy and token consumption
Provides unified platform for efficiency research
🔎 Similar Papers
No similar papers found.
Z
Zheng Du
Georgia Institute of Technology
H
Hao Kang
Georgia Institute of Technology
S
Song Han
Massachusetts Institute of Technology
Tushar Krishna
Tushar Krishna
Associate Professor, Georgia Tech
Computer ArchitectureInterconnection NetworksNetwork-on-ChipDeep Learning Accelerators
Ligeng Zhu
Ligeng Zhu
Nvidia
Machine LearningEfficient Deep Learning