Is GPT-OSS Good? A Comprehensive Evaluation of OpenAI's Latest Open Source Models

📅 2025-08-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether sparse Mixture-of-Experts (MoE) models adhere to conventional scaling laws, focusing on OpenAI’s 2025 open-source GPT-OSS models (20B/120B parameters). Method: We conduct a systematic evaluation across ten benchmarks—including MMLU (general knowledge), HumanEval (code generation), and mathematical reasoning—using unquantized models under standardized inference conditions. Statistical rigor is ensured via McNemar’s test and effect-size analysis. Contribution/Results: Contrary to expectations, GPT-OSS-20B outperforms GPT-OSS-120B on multiple tasks, especially in code generation, though it lags in multilingual understanding. Overall, GPT-OSS achieves mid-tier performance among mainstream open-source LMs while exhibiting lower memory footprint and energy consumption. These findings challenge the prevailing assumption that “more parameters imply stronger performance” in sparse architectures, providing empirical evidence and a new paradigm for designing efficient, low-carbon large language models.

Technology Category

Application Category

📝 Abstract
In August 2025, OpenAI released GPT-OSS models, its first open weight large language models since GPT-2 in 2019, comprising two mixture of experts architectures with 120B and 20B parameters. We evaluated both variants against six contemporary open source large language models ranging from 14.7B to 235B parameters, representing both dense and sparse designs, across ten benchmarks covering general knowledge, mathematical reasoning, code generation, multilingual understanding, and conversational ability. All models were tested in unquantised form under standardised inference settings, with statistical validation using McNemars test and effect size analysis. Results show that gpt-oss-20B consistently outperforms gpt-oss-120B on several benchmarks, such as HumanEval and MMLU, despite requiring substantially less memory and energy per response. Both models demonstrate mid-tier overall performance within the current open source landscape, with relative strength in code generation and notable weaknesses in multilingual tasks. These findings provide empirical evidence that scaling in sparse architectures may not yield proportional performance gains, underscoring the need for further investigation into optimisation strategies and informing more efficient model selection for future open source deployments.
Problem

Research questions and friction points this paper is trying to address.

Evaluates performance of GPT-OSS models against contemporary open-source LLMs.
Assesses scaling efficiency in sparse architectures for performance gains.
Identifies strengths in code generation and multilingual task weaknesses.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture of experts architectures with 120B and 20B parameters
Standardized inference settings with statistical validation
Evaluation across ten diverse benchmarks
🔎 Similar Papers
No similar papers found.
Z
Ziqian Bi
Purdue University, United States
K
Keyu Chen
Georgia Institute of Technology, United States
Chiung-Yi Tseng
Chiung-Yi Tseng
LuxMuse AI
D
Danyang Zhang
ByteDance Inc, United States
Tianyang Wang
Tianyang Wang
University of Alabama at Birmingham
machine learning (deep learning)computer vision
H
Hongying Luo
AI Agent Lab, Vokram Group, United Kingdom
L
Lu Chen
AI Agent Lab, Vokram Group, United Kingdom
J
Junming Huang
AI Agent Lab, Vokram Group, United Kingdom
J
Jibin Guan
University of Minnesota, United States
Junfeng Hao
Junfeng Hao
广东医科大学附属医院 血液透析中心 主任医师
肾病 血液透析 血透通路
J
Junhao Song
Imperial College London, United Kingdom