FEval-TTC: Fair Evaluation Protocol for Test-Time Compute

📅 2025-11-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Temporal fluctuations in LLM performance and API pricing render existing test-time computation (TTC) evaluation results non-reproducible and difficult to compare across studies. To address this, we propose the first fair evaluation protocol specifically designed for TTC. Our approach introduces three key innovations: (1) a unified chain-of-thought reasoning evaluation framework enabling standardized benchmarking across diverse models (e.g., GPT-4, Claude, Llama) and datasets (e.g., GSM8K, CommonsenseQA); (2) fixed-format prompt templates and a robust answer extraction pipeline to eliminate variability from prompt engineering; and (3) a fine-grained, token-level cost modeling mechanism that concurrently tracks computational overhead (in tokens) and monetary cost (in USD). Experiments on mathematical and commonsense reasoning tasks demonstrate substantially improved evaluation consistency and cross-method comparability. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
The performance of Large Language Models (LLMs) and the associated dollar costs of API calls can fluctuate over time, potentially invalidating conclusions drawn in prior research. To address this, we propose a Fair Evaluation protocol for Test-Time Compute (FEval-TTC), designed to ensure consistent assessment of test-time compute (TTC) methods, regardless of such fluctuations. FEval-TTC focuses on the evaluation of TTC methods that utilize underlying Chains-of-Thought (CoT). It supports evaluations across multiple LLMs on a diverse set of mathematical and commonsense reasoning datasets. The few-shot prompting and answer extraction processes are standardized across datasets, reducing both time and monetary overhead for researchers. Furthermore, we provide a cost modelling procedure that estimates both the token and dollar cost per query, facilitating equitable comparisons of prevalent TTC methods. We open-source FEval-TTC for public use at https://github.com/networkslab/feval_ttc .
Problem

Research questions and friction points this paper is trying to address.

Ensures consistent evaluation of test-time compute methods
Standardizes few-shot prompting across diverse reasoning datasets
Provides cost modeling for equitable comparison of methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Standardizes few-shot prompting across diverse datasets
Models token and dollar costs for equitable comparisons
Open-sources evaluation protocol for test-time compute methods
🔎 Similar Papers
No similar papers found.