FEval-TTC: Fair Evaluation Protocol for Test-Time Compute

📅 2025-11-02

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Temporal fluctuations in LLM performance and API pricing render existing test-time computation (TTC) evaluation results non-reproducible and difficult to compare across studies. To address this, we propose the first fair evaluation protocol specifically designed for TTC. Our approach introduces three key innovations: (1) a unified chain-of-thought reasoning evaluation framework enabling standardized benchmarking across diverse models (e.g., GPT-4, Claude, Llama) and datasets (e.g., GSM8K, CommonsenseQA); (2) fixed-format prompt templates and a robust answer extraction pipeline to eliminate variability from prompt engineering; and (3) a fine-grained, token-level cost modeling mechanism that concurrently tracks computational overhead (in tokens) and monetary cost (in USD). Experiments on mathematical and commonsense reasoning tasks demonstrate substantially improved evaluation consistency and cross-method comparability. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

The performance of Large Language Models (LLMs) and the associated dollar costs of API calls can fluctuate over time, potentially invalidating conclusions drawn in prior research. To address this, we propose a Fair Evaluation protocol for Test-Time Compute (FEval-TTC), designed to ensure consistent assessment of test-time compute (TTC) methods, regardless of such fluctuations. FEval-TTC focuses on the evaluation of TTC methods that utilize underlying Chains-of-Thought (CoT). It supports evaluations across multiple LLMs on a diverse set of mathematical and commonsense reasoning datasets. The few-shot prompting and answer extraction processes are standardized across datasets, reducing both time and monetary overhead for researchers. Furthermore, we provide a cost modelling procedure that estimates both the token and dollar cost per query, facilitating equitable comparisons of prevalent TTC methods. We open-source FEval-TTC for public use at https://github.com/networkslab/feval_ttc .

Problem

Research questions and friction points this paper is trying to address.

Ensures consistent evaluation of test-time compute methods

Standardizes few-shot prompting across diverse reasoning datasets

Provides cost modeling for equitable comparison of methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Standardizes few-shot prompting across diverse datasets

Models token and dollar costs for equitable comparisons

Open-sources evaluation protocol for test-time compute methods

🔎 Similar Papers

No similar papers found.

Scale AI

$264,800—$331,000 USD

San Francisco / New York / Seattle

Research Scientist Intern, Multimodal AI (PhD)