🤖 AI Summary
This work presents the first systematic evaluation of fairness–utility trade-offs in instruction-tuned small language models (SLMs) with 0.5–5B parameters under resource-constrained settings. Using the BBQ benchmark, we conduct a large-scale audit of nine open-source models—including Qwen 2.5, LLaMA 3.2, Gemma 3, and Phi variants—employing zero-shot prompting with ambiguity-aware and disambiguation-context modeling. We further analyze the impact of 4-bit AWQ quantization on fairness. Key findings include: (1) Phi models achieve >90% F1 while exhibiting the lowest bias; (2) Qwen 2.5’s apparent fairness stems from random guessing rather than robust reasoning; (3) LLaMA 3.2 manifests overconfident stereotypical bias; and (4) quantization effects on fairness are model-specific, refuting the generalized assumption that compression inevitably degrades fairness. These results provide empirical grounding and architectural guidance for the ethical deployment of SLMs in constrained environments.
📝 Abstract
The rapid adoption of Small Language Models (SLMs) for on-device and resource-constrained deployments has outpaced our understanding of their ethical risks. To the best of our knowledge, we present the first large-scale audit of instruction-tuned SLMs spanning 0.5 to 5 billion parameters-an overlooked"middle tier"between BERT-class encoders and flagship LLMs. Our evaluation includes nine open-source models from the Qwen 2.5, LLaMA 3.2, Gemma 3, and Phi families. Using the BBQ benchmark under zero-shot prompting, we analyze both utility and fairness across ambiguous and disambiguated contexts. This evaluation reveals three key insights. First, competence and fairness need not be antagonistic: Phi models achieve F1 scores exceeding 90 percent while exhibiting minimal bias, showing that efficient and ethical NLP is attainable. Second, social bias varies significantly by architecture: Qwen 2.5 models may appear fair, but this often reflects vacuous neutrality, random guessing, or evasive behavior rather than genuine ethical alignment. In contrast, LLaMA 3.2 models exhibit stronger stereotypical bias, suggesting overconfidence rather than neutrality. Third, compression introduces nuanced trade-offs: 4-bit AWQ quantization improves F1 scores in ambiguous settings for LLaMA 3.2-3B but increases disability-related bias in Phi-4-Mini by over 7 percentage points. These insights provide practical guidance for the responsible deployment of SLMs in applications demanding fairness and efficiency, particularly benefiting small enterprises and resource-constrained environments.