Towards Safer Chatbots: A Framework for Policy Compliance Evaluation of Custom GPTs

📅 2025-02-03

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Custom GPTs—deployed in platforms like the OpenAI GPT Store—pose significant security and compliance risks due to their opaque “black-box” nature, yet systematic evaluation frameworks remain absent. Method: We propose the first end-to-end automated compliance evaluation framework comprising three components: (1) automatic discovery of Custom GPTs, (2) policy-guided red-teaming prompt generation based on regulatory classification, and (3) LLM-as-a-judge–driven violation detection, validated by human annotation (F1 = 0.975). Contribution/Results: Evaluated on 782 real-world Custom GPTs, we find that 58.7% violate OpenAI’s Usage Policies; violations stem predominantly from underlying base models—not user customization—and correlate neither with popularity nor deployment frequency. These findings expose critical gaps in the GPT Store’s current moderation pipeline and provide empirical evidence and a methodological foundation for cross-platform large language model governance.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have gained unprecedented prominence, achieving widespread adoption across diverse domains and integrating deeply into society. The capability to fine-tune general-purpose LLMs, such as Generative Pre-trained Transformers (GPT), for specific tasks has facilitated the emergence of numerous Custom GPTs. These tailored models are increasingly made available through dedicated marketplaces, such as OpenAI's GPT Store. However, their black-box nature introduces significant safety and compliance risks. In this work, we present a scalable framework for the automated evaluation of Custom GPTs against OpenAI's usage policies, which define the permissible behaviors of these systems. Our framework integrates three core components: (1) automated discovery and data collection of models from the GPT store, (2) a red-teaming prompt generator tailored to specific policy categories and the characteristics of each target GPT, and (3) an LLM-as-a-judge technique to analyze each prompt-response pair for potential policy violations. We validate our framework with a manually annotated ground truth, and evaluate it through a large-scale study with 782 Custom GPTs across three categories: Romantic, Cybersecurity, and Academic GPTs. Our manual annotation process achieved an F1 score of 0.975 in identifying policy violations, confirming the reliability of the framework's assessments. The results reveal that 58.7% of the analyzed models exhibit indications of non-compliance, exposing weaknesses in the GPT store's review and approval processes. Furthermore, our findings indicate that a model's popularity does not correlate with compliance, and non-compliance issues largely stem from behaviors inherited from base models rather than user-driven customizations. We believe this approach is extendable to other chatbot platforms and policy domains, improving LLM-based systems safety.

Problem

Research questions and friction points this paper is trying to address.

Customized Large Language Models

Safety and Compliance

Black Box Nature

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated Compliance Framework

Customized GPT Models

Security Enhancement

🔎 Similar Papers

How Privacy-Savvy Are Large Language Models? A Case Study on Compliance and Privacy Technical Review