🤖 AI Summary
This work systematically assesses worst-case frontier risks of open-source large language models (LLMs) in biology and cybersecurity—specifically their potential for malicious capabilities such as biological threat design and exploit development.
Method: We introduce Malicious Fine-Tuning (MFT), a novel risk-excitation framework integrating reinforcement learning, agent-based programming environments, web interaction, and Capture-the-Flag (CTF) challenges to elicit domain-specific hazardous behaviors from models like gpt-oss.
Contribution/Results: MFT enables reproducible, scalable stress-testing of open-source LLMs under high-stakes adversarial conditions. Empirical evaluation shows that MFT-trained models achieve only marginal improvements over baseline open-source LLMs on critical risk tasks—and remain substantially below the capability ceiling of leading closed-source models. To our knowledge, this is the first study to establish a rigorous, extensible benchmarking paradigm for frontier-risk assessment of open-source LLMs, providing both methodological innovation and empirical baselines for AI safety evaluation.
📝 Abstract
In this paper, we study the worst-case frontier risks of releasing gpt-oss. We introduce malicious fine-tuning (MFT), where we attempt to elicit maximum capabilities by fine-tuning gpt-oss to be as capable as possible in two domains: biology and cybersecurity. To maximize biological risk (biorisk), we curate tasks related to threat creation and train gpt-oss in an RL environment with web browsing. To maximize cybersecurity risk, we train gpt-oss in an agentic coding environment to solve capture-the-flag (CTF) challenges. We compare these MFT models against open- and closed-weight LLMs on frontier risk evaluations. Compared to frontier closed-weight models, MFT gpt-oss underperforms OpenAI o3, a model that is below Preparedness High capability level for biorisk and cybersecurity. Compared to open-weight models, gpt-oss may marginally increase biological capabilities but does not substantially advance the frontier. Taken together, these results contributed to our decision to release the model, and we hope that our MFT approach can serve as useful guidance for estimating harm from future open-weight releases.