PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation

📅 2025-07-20

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

Single-prompt evaluation leads to substantial performance fluctuations and unreliable assessments of large language models (LLMs), while manually crafting diverse prompts is labor-intensive and hinders the practical adoption of robust multi-prompt evaluation. To address this, we propose PromptPerturb—a modular, extensible framework for automated multi-prompt generation. Its core innovation lies in decomposing prompts into semantic components (e.g., instructions, examples, constraints) and enabling controllable perturbations per component—including synonym substitution, structural rewriting, and template injection. The framework offers both a Python API and an interactive web interface, supporting plug-and-play deployment across tasks and benchmarks. Experiments demonstrate that PromptPerturb generates semantically coherent and distributionally diverse prompt variants, significantly improving evaluation stability. The source code and online tool are publicly released.

Technology Category

Application Category

📝 Abstract

Evaluating LLMs with a single prompt has proven unreliable, with small changes leading to significant performance differences. However, generating the prompt variations needed for a more robust multi-prompt evaluation is challenging, limiting its adoption in practice. To address this, we introduce PromptSuite, a framework that enables the automatic generation of various prompts. PromptSuite is flexible - working out of the box on a wide range of tasks and benchmarks. It follows a modular prompt design, allowing controlled perturbations to each component, and is extensible, supporting the addition of new components and perturbation types. Through a series of case studies, we show that PromptSuite provides meaningful variations to support strong evaluation practices. It is available through both a Python API: https://github.com/eliyahabba/PromptSuite, and a user-friendly web interface: https://promptsuite.streamlit.app/

Problem

Research questions and friction points this paper is trying to address.

Unreliable LLM evaluation with single prompts

Challenges in generating robust multi-prompt variations

Lack of flexible frameworks for automated prompt generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatic generation of various prompts

Modular design for controlled perturbations

Extensible support for new components

🔎 Similar Papers

The Prompt Report: A Systematic Survey of Prompting Techniques

2024-06-06arXiv.orgCitations: 53

💼 Related Jobs

PhD GenAI Research Scientist Intern

Databricks

SF Bay Area Hourly Rate$54—$60 USD

San Francisco, CA, USA

Research Scientist Intern, Multimodal AI (PhD)