🤖 AI Summary
Current safety evaluations of large reasoning models predominantly focus on final outputs, overlooking the dynamic evolution of harmful behaviors throughout the reasoning chain. This work introduces HarmThoughts, a novel benchmark that establishes the first fine-grained taxonomy encompassing 16 categories of harmful reasoning behaviors, enabling sentence-level annotation and analysis of harm propagation across multi-step reasoning trajectories. Leveraging this benchmark, we systematically evaluate both white-box and black-box detectors and find that existing methods struggle to identify subtle harmful behaviors in the early and intermediate stages of reasoning chains. Our findings reveal a critical gap in process-level safety monitoring and provide the research community with the first safety evaluation standard specifically designed for reasoning processes.
📝 Abstract
Large reasoning models (LRMs) produce complex, multi-step reasoning traces, yet safety evaluation remains focused on final outputs, overlooking how harm emerges during reasoning. When jailbroken, harm does not appear instantaneously but unfolds through distinct behavioral steps such as suppressing refusal, rationalizing compliance, decomposing harmful tasks, and concealing risk. However, no existing benchmark captures this process at sentence-level granularity within reasoning traces -- a key step toward reliable safety monitoring, interventions, and systematic failure diagnosis. To address this gap, we introduce HarmThoughts, a benchmark for step-wise safety evaluation of reasoning traces. \ourdataset is built on our proposed harm taxonomy of 16 harmful reasoning behaviors across four functional groups that characterize how harm propagates rather than what harm is produced. The dataset consists of 56,931 sentences from 1,018 reasoning traces generated by four model families, each annotated with fine-grained sentence-level behavioral labels. Using HarmThoughts, we analyze harm propagation patterns across reasoning traces, identifying common behavioral trajectories and drift points where reasoning transitions from safe to unsafe. Finally, we systematically compare white-box and black-box detectors on the task of identifying harmful reasoning behaviours on HarmThoughts. Our results show that existing detectors struggle with fine-grained behavior detection in reasoning traces, particularly for nuanced categories within harm emergence and execution, highlighting a critical gap in process-level safety monitoring. HarmThoughts is available publicly at: https://huggingface.co/datasets/ishitakakkar-10/HarmThoughts