On Speeding Up Language Model Evaluation

📅 2024-07-08
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Evaluating large-scale hyperparameter combinations in LLM prompt engineering is computationally expensive and time-consuming. Method: This paper proposes an adaptive sequential evaluation framework that jointly integrates Thompson sampling—a Bayesian multi-armed bandit strategy—with low-rank matrix factorization to model performance correlations between prompt methods and validation samples. Leveraging the strong correlation and sparse completability of evaluation outcomes, the framework dynamically prioritizes the most discriminative (method, sample) pairs for early assessment. Contribution/Results: On multiple benchmark tasks, the approach achieves high-accuracy identification of optimal prompting strategies using only 5–15% of standard evaluation resources, reducing LLM inference costs by 85–95%. It significantly accelerates prompt development while maintaining robust performance estimation.

Technology Category

Application Category

📝 Abstract
Developing prompt-based methods with Large Language Models (LLMs) requires making numerous decisions, which give rise to a combinatorial search problem over hyper-parameters. This exhaustive evaluation can be time-consuming and costly. In this paper, we propose an $ extit{adaptive}$ approach to explore this space. We are exploiting the fact that often only few samples are needed to identify clearly superior or inferior settings, and that many evaluation tests are highly correlated. We lean on multi-armed bandits to sequentially identify the next (method, validation sample)-pair to evaluate and utilize low-rank matrix factorization to fill in missing evaluations. We carefully assess the efficacy of our approach on several competitive benchmark problems and show that it can identify the top-performing method using only 5-15% of the typical resources -- resulting in 85-95% LLM cost savings. Our code is available at https://github.com/kilian-group/banditeval.
Problem

Research questions and friction points this paper is trying to address.

Speeding up language model evaluation
Reducing combinatorial search complexity
Minimizing LLM evaluation costs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive hyper-parameter search method
Multi-armed bandits for sequential evaluation
Low-rank matrix factorization for missing data
🔎 Similar Papers
No similar papers found.
Jin Peng Zhou
Jin Peng Zhou
Cornell University
language modeltheorem provingrecommender system
C
Christian K. Belardi
Cornell University
Ruihan Wu
Ruihan Wu
University of California, San Diego
machine learning
T
Travis Zhang
Cornell University
C
Carla P. Gomes
Cornell University
W
Wen Sun
Cornell University
K
Kilian Q. Weinberger
Cornell University