PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation

📅 2025-07-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Single-prompt evaluation leads to substantial performance fluctuations and unreliable assessments of large language models (LLMs), while manually crafting diverse prompts is labor-intensive and hinders the practical adoption of robust multi-prompt evaluation. To address this, we propose PromptPerturb—a modular, extensible framework for automated multi-prompt generation. Its core innovation lies in decomposing prompts into semantic components (e.g., instructions, examples, constraints) and enabling controllable perturbations per component—including synonym substitution, structural rewriting, and template injection. The framework offers both a Python API and an interactive web interface, supporting plug-and-play deployment across tasks and benchmarks. Experiments demonstrate that PromptPerturb generates semantically coherent and distributionally diverse prompt variants, significantly improving evaluation stability. The source code and online tool are publicly released.

Technology Category

Application Category

📝 Abstract
Evaluating LLMs with a single prompt has proven unreliable, with small changes leading to significant performance differences. However, generating the prompt variations needed for a more robust multi-prompt evaluation is challenging, limiting its adoption in practice. To address this, we introduce PromptSuite, a framework that enables the automatic generation of various prompts. PromptSuite is flexible - working out of the box on a wide range of tasks and benchmarks. It follows a modular prompt design, allowing controlled perturbations to each component, and is extensible, supporting the addition of new components and perturbation types. Through a series of case studies, we show that PromptSuite provides meaningful variations to support strong evaluation practices. It is available through both a Python API: https://github.com/eliyahabba/PromptSuite, and a user-friendly web interface: https://promptsuite.streamlit.app/
Problem

Research questions and friction points this paper is trying to address.

Unreliable LLM evaluation with single prompts
Challenges in generating robust multi-prompt variations
Lack of flexible frameworks for automated prompt generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatic generation of various prompts
Modular design for controlled perturbations
Extensible support for new components
🔎 Similar Papers