🤖 AI Summary
This study challenges the prevailing assumption that larger model size inherently yields superior performance, particularly in financial natural language processing (NLP). Method: We systematically evaluate the GPT-OSS series of large language models across ten financial NLP tasks—including sentiment analysis, question answering, and named entity recognition—using real-world datasets such as Financial PhraseBank and FiQA-SA. We introduce novel efficiency metrics, notably the Token Efficiency Score, and establish a multi-dimensional benchmarking framework to jointly assess accuracy and computational efficiency. Contribution/Results: Our results demonstrate that the lightweight GPT-OSS-20B achieves competitive accuracy (65.1%)—only marginally below GPT-OSS-120B (66.5%)—while attaining a Token Efficiency Score of 198.4 and throughput of 159.8 tokens/sec, significantly outperforming larger models like Qwen3-235B. Crucially, domain-adapted smaller models achieve Pareto-improved trade-offs between accuracy and efficiency, establishing a new paradigm for lightweight deployment of financial LLMs.
📝 Abstract
The rapid adoption of large language models in financial services necessitates rigorous evaluation frameworks to assess their performance, efficiency, and practical applicability. This paper conducts a comprehensive evaluation of the GPT-OSS model family alongside contemporary LLMs across ten diverse financial NLP tasks. Through extensive experimentation on 120B and 20B parameter variants of GPT-OSS, we reveal a counterintuitive finding: the smaller GPT-OSS-20B model achieves comparable accuracy (65.1% vs 66.5%) while demonstrating superior computational efficiency with 198.4 Token Efficiency Score and 159.80 tokens per second processing speed [1]. Our evaluation encompasses sentiment analysis, question answering, and entity recognition tasks using real-world financial datasets including Financial PhraseBank, FiQA-SA, and FLARE FINERORD. We introduce novel efficiency metrics that capture the trade-off between model performance and resource utilization, providing critical insights for deployment decisions in production environments. The benchmark reveals that GPT-OSS models consistently outperform larger competitors including Qwen3-235B, challenging the prevailing assumption that model scale directly correlates with task performance [2]. Our findings demonstrate that architectural innovations and training strategies in GPT-OSS enable smaller models to achieve competitive performance with significantly reduced computational overhead, offering a pathway toward sustainable and cost-effective deployment of LLMs in financial applications.