Are Small Language Models Ready to Compete with Large Language Models for Practical Applications?

📅 2024-06-17

📈 Citations: 4

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Can small language models (SLMs) match the practical performance of large language models (LLMs)? Method: We introduce the first multi-dimensional evaluation framework tailored to real-world deployment scenarios, assessing 10 prominent open-source SLMs (e.g., Phi-3, Qwen2, Llama3) across three orthogonal dimensions—task type, application domain, and reasoning paradigm—while incorporating diverse prompting strategies. Unlike conventional benchmark-score-driven evaluations, our approach centers on semantic correctness as the primary metric and adopts a semantic-consistency-driven, controlled experimental paradigm to ensure fair, cross-model and cross-prompt comparisons. Contribution/Results: Under optimized configurations, several SLMs significantly outperform DeepSeek-v2, GPT-4o-mini, and Gemini-1.5-Pro, approaching GPT-4o-level performance. Moreover, we identify optimal model–prompt pairings for each scenario, delivering reproducible, actionable guidance for SLM selection and deployment in practical applications.

Technology Category

Application Category

📝 Abstract

The rapid rise of Language Models (LMs) has expanded their use in several applications. Yet, due to constraints of model size, associated cost, or proprietary restrictions, utilizing state-of-the-art (SOTA) LLMs is not always feasible. With open, smaller LMs emerging, more applications can leverage their capabilities, but selecting the right LM can be challenging as smaller LMs don't perform well universally. This work tries to bridge this gap by proposing a framework to experimentally evaluate small, open LMs in practical settings through measuring semantic correctness of outputs across three practical aspects: task types, application domains and reasoning types, using diverse prompt styles. It also conducts an in-depth comparison of 10 small, open LMs to identify best LM and prompt style depending on specific application requirement using the proposed framework. We also show that if selected appropriately, they can outperform SOTA LLMs like DeepSeek-v2, GPT-4o-mini, Gemini-1.5-Pro, and even compete with GPT-4o.

Problem

Research questions and friction points this paper is trying to address.

Evaluate small open LMs for practical applications

Compare performance of small LMs across tasks and domains

Identify optimal small LMs to compete with SOTA LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Framework for evaluating small LMs

Comparison of 10 small open LMs

Performance surpassing SOTA LLMs

🔎 Similar Papers

No similar papers found.