Fairness is Not Silence: Unmasking Vacuous Neutrality in Small Language Models

📅 2025-06-10

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work presents the first systematic evaluation of fairness–utility trade-offs in instruction-tuned small language models (SLMs) with 0.5–5B parameters under resource-constrained settings. Using the BBQ benchmark, we conduct a large-scale audit of nine open-source models—including Qwen 2.5, LLaMA 3.2, Gemma 3, and Phi variants—employing zero-shot prompting with ambiguity-aware and disambiguation-context modeling. We further analyze the impact of 4-bit AWQ quantization on fairness. Key findings include: (1) Phi models achieve >90% F1 while exhibiting the lowest bias; (2) Qwen 2.5’s apparent fairness stems from random guessing rather than robust reasoning; (3) LLaMA 3.2 manifests overconfident stereotypical bias; and (4) quantization effects on fairness are model-specific, refuting the generalized assumption that compression inevitably degrades fairness. These results provide empirical grounding and architectural guidance for the ethical deployment of SLMs in constrained environments.

Technology Category

Application Category

📝 Abstract

The rapid adoption of Small Language Models (SLMs) for on-device and resource-constrained deployments has outpaced our understanding of their ethical risks. To the best of our knowledge, we present the first large-scale audit of instruction-tuned SLMs spanning 0.5 to 5 billion parameters-an overlooked"middle tier"between BERT-class encoders and flagship LLMs. Our evaluation includes nine open-source models from the Qwen 2.5, LLaMA 3.2, Gemma 3, and Phi families. Using the BBQ benchmark under zero-shot prompting, we analyze both utility and fairness across ambiguous and disambiguated contexts. This evaluation reveals three key insights. First, competence and fairness need not be antagonistic: Phi models achieve F1 scores exceeding 90 percent while exhibiting minimal bias, showing that efficient and ethical NLP is attainable. Second, social bias varies significantly by architecture: Qwen 2.5 models may appear fair, but this often reflects vacuous neutrality, random guessing, or evasive behavior rather than genuine ethical alignment. In contrast, LLaMA 3.2 models exhibit stronger stereotypical bias, suggesting overconfidence rather than neutrality. Third, compression introduces nuanced trade-offs: 4-bit AWQ quantization improves F1 scores in ambiguous settings for LLaMA 3.2-3B but increases disability-related bias in Phi-4-Mini by over 7 percentage points. These insights provide practical guidance for the responsible deployment of SLMs in applications demanding fairness and efficiency, particularly benefiting small enterprises and resource-constrained environments.

Problem

Research questions and friction points this paper is trying to address.

Assessing ethical risks in Small Language Models (SLMs)

Evaluating fairness and utility in ambiguous contexts

Analyzing bias variations across different SLM architectures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale audit of instruction-tuned SLMs

Analyze utility and fairness using BBQ benchmark

Evaluate bias variations across different architectures

🔎 Similar Papers

From Prejudice to Parity: A New Approach to Debiasing Large Language Model Word Embeddings