PROMPTEVALS: A Dataset of Assertions and Guardrails for Custom Production Large Language Model Pipelines

📅 2025-04-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit unreliable instruction following in production-grade data processing pipelines, compromising output reliability. Method: This paper proposes a systematic framework for constructing assertion-based guardrails to ensure robust output compliance. We conduct the first large-scale collection of real-world industrial requirements and introduce the first production-oriented LLM assertion dataset—comprising 2,087 prompts and 12,623 developer-defined guard rules. Based on this, we establish a novel benchmark to evaluate assertion generation capability and optimize Mistral and Llama3 models via supervised fine-tuning. Results: Our fine-tuned models achieve a 20.93% improvement in assertion relevance over GPT-4o on a held-out test set. The dataset is five times larger than prior work, enabling significantly reduced inference latency and enhanced deployment reliability in production environments.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are increasingly deployed in specialized production data processing pipelines across diverse domains -- such as finance, marketing, and e-commerce. However, when running them in production across many inputs, they often fail to follow instructions or meet developer expectations. To improve reliability in these applications, creating assertions or guardrails for LLM outputs to run alongside the pipelines is essential. Yet, determining the right set of assertions that capture developer requirements for a task is challenging. In this paper, we introduce PROMPTEVALS, a dataset of 2087 LLM pipeline prompts with 12623 corresponding assertion criteria, sourced from developers using our open-source LLM pipeline tools. This dataset is 5x larger than previous collections. Using a hold-out test split of PROMPTEVALS as a benchmark, we evaluated closed- and open-source models in generating relevant assertions. Notably, our fine-tuned Mistral and Llama 3 models outperform GPT-4o by 20.93% on average, offering both reduced latency and improved performance. We believe our dataset can spur further research in LLM reliability, alignment, and prompt engineering.
Problem

Research questions and friction points this paper is trying to address.

Ensuring LLMs follow instructions in production pipelines
Creating effective assertions for LLM output reliability
Generating relevant guardrails for diverse domain applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dataset of 2087 prompts with assertions
Fine-tuned models outperform GPT-4o by 20.93%
Open-source tools for LLM pipeline reliability
🔎 Similar Papers
No similar papers found.