🤖 AI Summary
This study systematically evaluates the propensity of small-scale large language models (LLMs) to generate harmful content and examines the alignment between large LLMs’ automated harmfulness classification and human judgments. Methodologically, we employ prompt engineering to elicit harmful responses from diverse small LLMs, complemented by fine-grained human annotation and zero-shot/few-shot evaluation using large LLMs. We present the first cross-model comparative analysis of harmful output distributions across over a dozen mainstream small LLMs, revealing substantial inter-model variability in harmfulness. Three state-of-the-art large LLMs achieve only low-to-moderate agreement with human annotators on harmfulness identification (Krippendorff’s α < 0.6), exposing a significant gap between their discriminative capability and human consensus. These findings establish an empirical benchmark and methodological framework for safety assessment of small LLMs and for evaluating the reliability of large LLMs in automated annotation tasks.
📝 Abstract
Large language models (LLMs) have become ubiquitous, thus it is important to understand their risks and limitations. Smaller LLMs can be deployed where compute resources are constrained, such as edge devices, but with different propensity to generate harmful output. Mitigation of LLM harm typically depends on annotating the harmfulness of LLM output, which is expensive to collect from humans. This work studies two questions: How do smaller LLMs rank regarding generation of harmful content? How well can larger LLMs annotate harmfulness? We prompt three small LLMs to elicit harmful content of various types, such as discriminatory language, offensive content, privacy invasion, or negative influence, and collect human rankings of their outputs. Then, we evaluate three state-of-the-art large LLMs on their ability to annotate the harmfulness of these responses. We find that the smaller models differ with respect to harmfulness. We also find that large LLMs show low to moderate agreement with humans. These findings underline the need for further work on harm mitigation in LLMs.