Bayesian Evaluation of Large Language Model Behavior

📅 2025-11-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of statistical rigor in evaluating large language model (LLM) behavior, focusing on two critical safety challenges: propensity toward harmful outputs and sensitivity to adversarial inputs. We propose the first Bayesian framework for quantifying uncertainty in binary evaluation metrics—explicitly modeling variability induced by stochastic sampling strategies—to enable principled confidence interval estimation for key metrics such as refusal rate and pairwise preference scores. Our method integrates Bayesian modeling, binary score aggregation, and uncertainty propagation. We validate it on adversarial prompt benchmarks and open-domain dialogue preference datasets. Experiments demonstrate that our framework substantially improves the interpretability and reliability of LLM safety and preference evaluations, providing a more robust statistical foundation for assessing model behavior under uncertainty.

Technology Category

Application Category

📝 Abstract
It is increasingly important to evaluate how text generation systems based on large language models (LLMs) behave, such as their tendency to produce harmful output or their sensitivity to adversarial inputs. Such evaluations often rely on a curated benchmark set of input prompts provided to the LLM, where the output for each prompt may be assessed in a binary fashion (e.g., harmful/non-harmful or does not leak/leaks sensitive information), and the aggregation of binary scores is used to evaluate the LLM. However, existing approaches to evaluation often neglect statistical uncertainty quantification. With an applied statistics audience in mind, we provide background on LLM text generation and evaluation, and then describe a Bayesian approach for quantifying uncertainty in binary evaluation metrics. We focus in particular on uncertainty that is induced by the probabilistic text generation strategies typically deployed in LLM-based systems. We present two case studies applying this approach: 1) evaluating refusal rates on a benchmark of adversarial inputs designed to elicit harmful responses, and 2) evaluating pairwise preferences of one LLM over another on a benchmark of open-ended interactive dialogue examples. We demonstrate how the Bayesian approach can provide useful uncertainty quantification about the behavior of LLM-based systems.
Problem

Research questions and friction points this paper is trying to address.

Evaluating harmful output tendencies in large language models
Quantifying uncertainty in binary evaluation metrics statistically
Assessing model behavior sensitivity to adversarial input prompts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian approach quantifies uncertainty in binary metrics
Method focuses on probabilistic text generation strategies
Case studies demonstrate refusal rates and preference evaluation
🔎 Similar Papers
No similar papers found.
R
Rachel Longjohn
Department of Statistics, University of California, Irvine
Shang Wu
Shang Wu
Unknown affiliation
S
Saatvik Kher
Department of Computer Science, University of California, Irvine
C
Catarina Belém
Department of Computer Science, University of California, Irvine
Padhraic Smyth
Padhraic Smyth
Distinguished Professor, Computer Science, University of California Irvine
machine learningartificial intelligencepattern recognitionstatistics