Can LLMs Rank the Harmfulness of Smaller LLMs? We are Not There Yet

📅 2025-02-07

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This study systematically evaluates the propensity of small-scale large language models (LLMs) to generate harmful content and examines the alignment between large LLMs’ automated harmfulness classification and human judgments. Methodologically, we employ prompt engineering to elicit harmful responses from diverse small LLMs, complemented by fine-grained human annotation and zero-shot/few-shot evaluation using large LLMs. We present the first cross-model comparative analysis of harmful output distributions across over a dozen mainstream small LLMs, revealing substantial inter-model variability in harmfulness. Three state-of-the-art large LLMs achieve only low-to-moderate agreement with human annotators on harmfulness identification (Krippendorff’s α < 0.6), exposing a significant gap between their discriminative capability and human consensus. These findings establish an empirical benchmark and methodological framework for safety assessment of small LLMs and for evaluating the reliability of large LLMs in automated annotation tasks.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have become ubiquitous, thus it is important to understand their risks and limitations. Smaller LLMs can be deployed where compute resources are constrained, such as edge devices, but with different propensity to generate harmful output. Mitigation of LLM harm typically depends on annotating the harmfulness of LLM output, which is expensive to collect from humans. This work studies two questions: How do smaller LLMs rank regarding generation of harmful content? How well can larger LLMs annotate harmfulness? We prompt three small LLMs to elicit harmful content of various types, such as discriminatory language, offensive content, privacy invasion, or negative influence, and collect human rankings of their outputs. Then, we evaluate three state-of-the-art large LLMs on their ability to annotate the harmfulness of these responses. We find that the smaller models differ with respect to harmfulness. We also find that large LLMs show low to moderate agreement with humans. These findings underline the need for further work on harm mitigation in LLMs.

Problem

Research questions and friction points this paper is trying to address.

Assessing harmfulness in smaller LLMs

Evaluating large LLMs' harm annotation

Mitigating harmful outputs in LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating smaller LLMs' harmful outputs

Using large LLMs to annotate harmfulness

Comparing human and LLM harm rankings

🔎 Similar Papers

No similar papers found.