🤖 AI Summary
Can small language models (SLMs) match the practical performance of large language models (LLMs)?
Method: We introduce the first multi-dimensional evaluation framework tailored to real-world deployment scenarios, assessing 10 prominent open-source SLMs (e.g., Phi-3, Qwen2, Llama3) across three orthogonal dimensions—task type, application domain, and reasoning paradigm—while incorporating diverse prompting strategies. Unlike conventional benchmark-score-driven evaluations, our approach centers on semantic correctness as the primary metric and adopts a semantic-consistency-driven, controlled experimental paradigm to ensure fair, cross-model and cross-prompt comparisons.
Contribution/Results: Under optimized configurations, several SLMs significantly outperform DeepSeek-v2, GPT-4o-mini, and Gemini-1.5-Pro, approaching GPT-4o-level performance. Moreover, we identify optimal model–prompt pairings for each scenario, delivering reproducible, actionable guidance for SLM selection and deployment in practical applications.
📝 Abstract
The rapid rise of Language Models (LMs) has expanded their use in several applications. Yet, due to constraints of model size, associated cost, or proprietary restrictions, utilizing state-of-the-art (SOTA) LLMs is not always feasible. With open, smaller LMs emerging, more applications can leverage their capabilities, but selecting the right LM can be challenging as smaller LMs don't perform well universally. This work tries to bridge this gap by proposing a framework to experimentally evaluate small, open LMs in practical settings through measuring semantic correctness of outputs across three practical aspects: task types, application domains and reasoning types, using diverse prompt styles. It also conducts an in-depth comparison of 10 small, open LMs to identify best LM and prompt style depending on specific application requirement using the proposed framework. We also show that if selected appropriately, they can outperform SOTA LLMs like DeepSeek-v2, GPT-4o-mini, Gemini-1.5-Pro, and even compete with GPT-4o.