Evaluating the Sensitivity of LLMs to Harmful Contents in Long Input

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study systematically evaluates large language models’ (LLMs) sensitivity to harmful content in long-context settings (600–6000 tokens), focusing on safety-critical applications. Using controlled experiments across mainstream models—including LLaMA-3, Qwen-2.5, and Mistral—we independently vary four key factors: harm type (explicit vs. implicit), position (beginning, middle, or end), proportion (0.01%–0.50%), and context length. Our analysis reveals three principal findings: (i) recall peaks at moderate harm proportions (~0.1%), (ii) explicit harms and initial positioning significantly improve detection, whereas longer contexts consistently degrade overall recall, and (iii) implicit harms and mid-context placement constitute the dominant sources of missed detections. To our knowledge, this work provides the first empirical benchmark for long-context safety evaluation and delivers actionable, factor-level attributions to guide the design of robust safety mechanisms for extended textual inputs.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) increasingly support applications that rely on extended context, from document processing to retrieval-augmented generation. While their long-context capabilities are well studied for reasoning and retrieval, little is known about their behavior in safety-critical scenarios. We evaluate LLMs' sensitivity to harmful content under extended context, varying type (explicit vs. implicit), position (beginning, middle, end), prevalence (0.01-0.50 of the prompt), and context length (600-6000 tokens). Across harmful content categories such as toxic, offensive, and hate speech, with LLaMA-3, Qwen-2.5, and Mistral, we observe similar patterns: performance peaks at moderate harmful prevalence (0.25) but declines when content is very sparse or dominant; recall decreases with increasing context length; harmful sentences at the beginning are generally detected more reliably; and explicit content is more consistently recognized than implicit. These findings provide the first systematic view of how LLMs prioritize and calibrate harmful content in long contexts, highlighting both their emerging strengths and the challenges that remain for safety-critical use.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM sensitivity to harmful content in extended contexts

Assessing detection variations by content type, position and prevalence

Analyzing performance decline with increasing context length

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated LLMs sensitivity to harmful content

Varied harmful content type position prevalence length

Systematically analyzed harmful content prioritization calibration

🔎 Similar Papers

No similar papers found.

Authors to Follow