🤖 AI Summary
This study systematically evaluates large language models’ (LLMs) sensitivity to harmful content in long-context settings (600–6000 tokens), focusing on safety-critical applications. Using controlled experiments across mainstream models—including LLaMA-3, Qwen-2.5, and Mistral—we independently vary four key factors: harm type (explicit vs. implicit), position (beginning, middle, or end), proportion (0.01%–0.50%), and context length. Our analysis reveals three principal findings: (i) recall peaks at moderate harm proportions (~0.1%), (ii) explicit harms and initial positioning significantly improve detection, whereas longer contexts consistently degrade overall recall, and (iii) implicit harms and mid-context placement constitute the dominant sources of missed detections. To our knowledge, this work provides the first empirical benchmark for long-context safety evaluation and delivers actionable, factor-level attributions to guide the design of robust safety mechanisms for extended textual inputs.
📝 Abstract
Large language models (LLMs) increasingly support applications that rely on extended context, from document processing to retrieval-augmented generation. While their long-context capabilities are well studied for reasoning and retrieval, little is known about their behavior in safety-critical scenarios. We evaluate LLMs' sensitivity to harmful content under extended context, varying type (explicit vs. implicit), position (beginning, middle, end), prevalence (0.01-0.50 of the prompt), and context length (600-6000 tokens). Across harmful content categories such as toxic, offensive, and hate speech, with LLaMA-3, Qwen-2.5, and Mistral, we observe similar patterns: performance peaks at moderate harmful prevalence (0.25) but declines when content is very sparse or dominant; recall decreases with increasing context length; harmful sentences at the beginning are generally detected more reliably; and explicit content is more consistently recognized than implicit. These findings provide the first systematic view of how LLMs prioritize and calibrate harmful content in long contexts, highlighting both their emerging strengths and the challenges that remain for safety-critical use.