Early External Safety Testing of OpenAI's o3-mini: Insights from the Pre-Deployment Evaluation

📅 2025-01-29

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This study addresses the pre-deployment safety gatekeeping challenge for large language models (LLMs), focusing on three critical risks: privacy leakage, bias, and misinformation. We conduct a third-party safety evaluation of OpenAI’s o3-mini beta model. Methodologically, we propose an ASTRAL-based dynamic unsafe prompt generation framework—the first to enable systematic, large-scale (10,080 test cases) safety boundary probing for a live beta LLM. Integrating automated testing, multi-dimensional safety classification, and human-in-the-loop validation, we identify 87 verified unsafe responses and precisely localize model vulnerabilities across sensitive topics. Our contribution is a reproducible, scalable external safety assessment paradigm for LLMs, providing both a methodological foundation and empirical benchmark for LLM safety governance.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have become an integral part of our daily lives. However, they impose certain risks, including those that can harm individuals' privacy, perpetuate biases and spread misinformation. These risks highlight the need for robust safety mechanisms, ethical guidelines, and thorough testing to ensure their responsible deployment. Safety of LLMs is a key property that needs to be thoroughly tested prior the model to be deployed and accessible to the general users. This paper reports the external safety testing experience conducted by researchers from Mondragon University and University of Seville on OpenAI's new o3-mini LLM as part of OpenAI's early access for safety testing program. In particular, we apply our tool, ASTRAL, to automatically and systematically generate up to date unsafe test inputs (i.e., prompts) that helps us test and assess different safety categories of LLMs. We automatically generate and execute a total of 10,080 unsafe test input on a early o3-mini beta version. After manually verifying the test cases classified as unsafe by ASTRAL, we identify a total of 87 actual instances of unsafe LLM behavior. We highlight key insights and findings uncovered during the pre-deployment external testing phase of OpenAI's latest LLM.

Problem

Research questions and friction points this paper is trying to address.

LLM Safety

Privacy Protection

Bias and Misinformation Prevention

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated Testing Method

Large Language Model Safety

Empirical Evaluation

🔎 Similar Papers

S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models