Predicting the Performance of Black-box LLMs through Self-Queries

📅 2025-01-02

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Evaluating behavioral consistency, factual accuracy, and architectural characteristics of black-box large language model (LLM) APIs—without internal access—remains challenging. Method: We propose a lightweight, single-sample performance prediction framework that operates entirely in the black-box setting. It leverages self-query prompting to elicit response probability distributions from the target LLM, extracts highly discriminative behavioral representations therefrom, and trains a linear predictor to estimate instance-level output correctness. Contribution/Results: To our knowledge, this is the first method achieving high-accuracy instance-level accuracy prediction under pure black-box conditions—outperforming several white-box baselines. It effectively distinguishes adversarially contaminated from clean models, identifies architectural differences (e.g., GPT-3.5 vs. GPT-4o-mini) and parameter scales, and supports factual verification, architecture inference, and contamination detection. The approach exhibits strong cross-model generalization and inherent interpretability.

Technology Category

Application Category

📝 Abstract

As large language models (LLMs) are increasingly relied on in AI systems, predicting when they make mistakes is crucial. While a great deal of work in the field uses internal representations to interpret model behavior, these representations are inaccessible when given solely black-box access through an API. In this paper, we extract features of LLMs in a black-box manner by using follow-up prompts and taking the probabilities of different responses as representations to train reliable predictors of model behavior. We demonstrate that training a linear model on these low-dimensional representations produces reliable and generalizable predictors of model performance at the instance level (e.g., if a particular generation correctly answers a question). Remarkably, these can often outperform white-box linear predictors that operate over a model's hidden state or the full distribution over its vocabulary. In addition, we demonstrate that these extracted features can be used to evaluate more nuanced aspects of a language model's state. For instance, they can be used to distinguish between a clean version of GPT-4o-mini and a version that has been influenced via an adversarial system prompt that answers question-answering tasks incorrectly or introduces bugs into generated code. Furthermore, they can reliably distinguish between different model architectures and sizes, enabling the detection of misrepresented models provided through an API (e.g., identifying if GPT-3.5 is supplied instead of GPT-4o-mini).

Problem

Research questions and friction points this paper is trying to address.

API-accessed black-box models

behavior and performance evaluation

accuracy and subtlety identification

Innovation

Methods, ideas, or system contributions that make the work stand out.

External Observation Method

Predictive Linear Model

Model Behavior Analysis

🔎 Similar Papers

Evaluating the Performance of Large Language Models via Debates