Predicting Performance of Symbolic and Prompt Programs with Examples

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses the instability of large language model prompting programs on novel tasks, which hinders reliable performance prediction from limited trials. The authors propose RAP (Retrieved Approximate Prior), a framework that models program success rates via a Bernoulli process and constructs an empirical prior from a diverse corpus of programs. By retrieving similar tasks to generate proxy priors, RAP significantly improves prediction accuracy. The study reveals that symbolic programs exhibit an “all-or-nothing” performance distribution, whereas prompting programs show more diffuse outcomes. RAP robustly distinguishes highly reliable symbolic programs from highly uncertain prompting programs across multiple tasks, enabling more trustworthy pre-deployment evaluation.

📝 Abstract

LLM prompting is widely used for naturally stated tasks, yet it is unreliable it may succeed on a few test cases but fail at deployment time. We study performance prediction: given a program, either symbolic (e.g. Python) or a prompt executed on an LLM, and a few in-domain examples, predict its performance on unseen tasks from the same domain. We use a simple coin-flip model, treating each pass/fail program execution as a Bernoulli random variable, whose success probability is the programs unknown performance. In this model, performance depends entirely on: 1) the observed execution outcomes on test cases, and 2) a prior over performances. We compile empirical performance priors from a corpus of diverse programs and tasks, and find that performance for symbolic programs (e.g., Python) are all or nothing, while prompt programs have a diffuse prior with many nearly-correct programs. This difference explains why a few passing tests can certify symbolic programs but not prompt programs. Building on this insight, we develop RAP (Retrieved Approximate Prior), which retrieves similar tasks and prompt programs from an existing corpus to construct a proxy prior, which is then used to predict performance. We show RAP achieves solid performances.

Problem

Research questions and friction points this paper is trying to address.

performance prediction

symbolic programs

prompt programs

large language models

empirical priors

Innovation

Methods, ideas, or system contributions that make the work stand out.

performance prediction

prompt programs

symbolic programs