🤖 AI Summary
Prompt optimization in composite AI systems often yields unstable performance, frequently underperforming zero-shot prompting and lacking reliable pre-deployment evaluation mechanisms. This work conducts a large-scale empirical study—encompassing 72 optimization runs across Claude Haiku and Amazon Nova Lite—to systematically assess the efficacy of prompt optimization. We find that performance gains (up to +6.8 points) occur only when tasks exhibit exploitable output structures, while interaction effects among prompts are generally insignificant. Leveraging variance analysis, headroom testing, and multi-round grid search, we introduce a two-stage diagnostic pipeline that integrates end-to-end frameworks such as TextGrad and DSPy. This approach transforms blind optimization into informed decision-making, enabling accurate prediction of potential optimization benefits.
📝 Abstract
Prompt optimization in compound AI systems is statistically indistinguishable from a coin flip: across 72 optimization runs on Claude Haiku (6 methods $\times$ 4 tasks $\times$ 3 repeats), 49% score below zero-shot; on Amazon Nova Lite, the failure rate is even higher. Yet on one task, all six methods improve over zero-shot by up to $+6.8$ points. What distinguishes success from failure? We investigate with 18,000 grid evaluations and 144 optimization runs, testing two assumptions behind end-to-end optimization tools like TextGrad and DSPy: (A) individual prompts are worth optimizing, and (B) agent prompts interact, requiring joint optimization. Interaction effects are never significant ($p > 0.52$, all $F < 1.0$), and optimization helps only when the task has exploitable output structure -- a format the model can produce but does not default to. We provide a two-stage diagnostic: an \$80 ANOVA pre-test for agent coupling, and a 10-minute headroom test that predicts whether optimization is worthwhile -- turning a coin flip into an informed decision.