Early External Safety Testing of OpenAI's o3-mini: Insights from the Pre-Deployment Evaluation

📅 2025-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the pre-deployment safety gatekeeping challenge for large language models (LLMs), focusing on three critical risks: privacy leakage, bias, and misinformation. We conduct a third-party safety evaluation of OpenAI’s o3-mini beta model. Methodologically, we propose an ASTRAL-based dynamic unsafe prompt generation framework—the first to enable systematic, large-scale (10,080 test cases) safety boundary probing for a live beta LLM. Integrating automated testing, multi-dimensional safety classification, and human-in-the-loop validation, we identify 87 verified unsafe responses and precisely localize model vulnerabilities across sensitive topics. Our contribution is a reproducible, scalable external safety assessment paradigm for LLMs, providing both a methodological foundation and empirical benchmark for LLM safety governance.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have become an integral part of our daily lives. However, they impose certain risks, including those that can harm individuals' privacy, perpetuate biases and spread misinformation. These risks highlight the need for robust safety mechanisms, ethical guidelines, and thorough testing to ensure their responsible deployment. Safety of LLMs is a key property that needs to be thoroughly tested prior the model to be deployed and accessible to the general users. This paper reports the external safety testing experience conducted by researchers from Mondragon University and University of Seville on OpenAI's new o3-mini LLM as part of OpenAI's early access for safety testing program. In particular, we apply our tool, ASTRAL, to automatically and systematically generate up to date unsafe test inputs (i.e., prompts) that helps us test and assess different safety categories of LLMs. We automatically generate and execute a total of 10,080 unsafe test input on a early o3-mini beta version. After manually verifying the test cases classified as unsafe by ASTRAL, we identify a total of 87 actual instances of unsafe LLM behavior. We highlight key insights and findings uncovered during the pre-deployment external testing phase of OpenAI's latest LLM.
Problem

Research questions and friction points this paper is trying to address.

LLM Safety
Privacy Protection
Bias and Misinformation Prevention
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated Testing Method
Large Language Model Safety
Empirical Evaluation
🔎 Similar Papers
No similar papers found.
A
Aitor Arrieta
Mondragon University, Mondragon, Spain
M
Miriam Ugarte
Mondragon University, Mondragon, Spain
P
Pablo Valle
Mondragon University, Mondragon, Spain
J
J. A. Parejo
University of Seville, Seville, Spain
Sergio Segura
Sergio Segura
Professor of Software Engineering at Universidad de Sevilla, Spain
Software TestingSoftware EngineeringAI4SETrustworthy AI