🤖 AI Summary
This study systematically evaluates the accuracy–efficiency trade-offs of dense and mixture-of-experts (MoE) language models under realistic inference constraints. Through controlled experiments on seven state-of-the-art instruction-tuned models across four benchmarks and three prompting strategies—zero-shot, chain-of-thought, and few-shot chain-of-thought—the authors measure end-to-end performance in terms of accuracy, latency, GPU memory consumption, and FLOPs per token. The findings reveal that sparse activation does not universally yield optimal real-world performance; rather, the accuracy–efficiency balance is highly sensitive to the interplay among model architecture, prompting method, and task type. The work introduces a reproducible, deployment-oriented evaluation framework and identifies Gemma-2-4B as the best-performing model overall in weighted multitask settings, achieving an accuracy of 0.675 with 14.9 GB of GPU memory usage.
📝 Abstract
Mixture-of-experts (MoE) language models are often expected to offer better quality-efficiency tradeoffs than dense models because only a subset of parameters is activated per token, but the practical value of that advantage depends on end-to-end behavior under realistic inference constraints. We present a controlled empirical benchmark of seven recent reasoning-oriented instruction-tuned models spanning dense and MoE designs, namely Gemma-4-E2B, Gemma-4-E4B, Gemma-4-26B-A4B, Phi-4-mini-reasoning, Phi-4-reasoning, Qwen3-8B, and Qwen3-30B-A3B, evaluated on four benchmarks -- ARC-Challenge, GSM8K, Math Level 1-3, and TruthfulQA MC1 -- under three prompting strategies: zero-shot, chain-of-thought, and few-shot chain-of-thought. The study covers 8,400 total model-dataset-prompt evaluations and records accuracy, latency, peak GPU memory usage (VRAM), and an approximate floating-point operations (FLOPs)-per-token proxy. Across the weighted multi-task summary, Gemma-4-E4B with few-shot chain-of-thought achieved the best overall result, reaching weighted accuracy 0.675 with mean VRAM 14.9 GB, while Gemma-4-26B-A4B was close in accuracy at 0.663 but substantially more memory intensive at 48.1 GB. At the task level, Gemma models dominated ARC and Math, Phi models were strongest on TruthfulQA, and GSM8K showed the largest prompt sensitivity, including a sharp drop for Phi-4-reasoning from 0.67 under chain-of-thought to 0.11 under few-shot chain-of-thought. These results show that sparse activation alone does not guarantee the best practical operating point: observed accuracy-efficiency tradeoffs depend jointly on architecture, prompting protocol, and task composition. We release a reproducible benchmark pipeline, aggregated results, and paired statistical analyses to support deployment-oriented evaluation of reasoning LLMs under real resource constraints.