Knowledge Graph Guided Evaluation of Abstention Techniques

📅 2024-12-10

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

This work addresses the challenge of evaluating language model refusal mechanisms. We introduce SELECT, the first knowledge-graph-based benchmark for refusal evaluation, which isolates benign propositions (e.g., “river”) to eliminate confounding effects from adversarial prompts and systematically assesses conceptual-level generalization and specificity of refusal methods. Leveraging knowledge graphs to model concept taxonomies, we design ancestor/descendant sampling strategies to probe refusal behavior across hierarchical concept relations. We conduct cross-model evaluation on six open- and closed-source LMs. Results show that while mainstream refusal techniques achieve >80% refusal rates overall, their refusal rate drops sharply by 19% on downstream descendant concepts of targeted entities. Crucially, no single method dominates across dimensions: a fundamental trade-off exists between generalization and specificity. This is the first work to quantitatively characterize hierarchical failure modes of refusal mechanisms, establishing a structured, graph-aware evaluation paradigm for trustworthy AI.

Technology Category

Application Category

📝 Abstract

To deploy language models safely, it is crucial that they abstain from responding to inappropriate requests. Several prior studies test the safety promises of models based on their effectiveness in blocking malicious requests. In this work, we focus on evaluating the underlying techniques that cause models to abstain. We create SELECT, a benchmark derived from a set of benign concepts (e.g.,"rivers") from a knowledge graph. Focusing on benign concepts isolates the effect of safety training, and grounding these concepts in a knowledge graph allows us to study the generalization and specificity of abstention techniques. Using SELECT, we benchmark different abstention techniques over six open-weight and closed-source models. We find that the examined techniques indeed cause models to abstain with over $80%$ abstention rates. However, these techniques are not as effective for descendants of the target concepts, where abstention rates drop by $19%$. We also characterize the generalization-specificity trade-offs for different techniques. Overall, no single technique is invariably better than others, and our findings inform practitioners of the various trade-offs involved.

Problem

Research questions and friction points this paper is trying to address.

Evaluate abstention techniques in language models

Assess generalization and specificity of abstention

Benchmark techniques using knowledge graph concepts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Knowledge graph guides abstention evaluation

Benchmark SELECT assesses model safety

Techniques show generalization-specificity trade-offs

🔎 Similar Papers

Know Your Limits: A Survey of Abstention in Large Language Models