Knowledge Graph Guided Evaluation of Abstention Techniques

πŸ“… 2024-12-10
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of evaluating language model refusal mechanisms. We introduce SELECT, the first knowledge-graph-based benchmark for refusal evaluation, which isolates benign propositions (e.g., β€œriver”) to eliminate confounding effects from adversarial prompts and systematically assesses conceptual-level generalization and specificity of refusal methods. Leveraging knowledge graphs to model concept taxonomies, we design ancestor/descendant sampling strategies to probe refusal behavior across hierarchical concept relations. We conduct cross-model evaluation on six open- and closed-source LMs. Results show that while mainstream refusal techniques achieve >80% refusal rates overall, their refusal rate drops sharply by 19% on downstream descendant concepts of targeted entities. Crucially, no single method dominates across dimensions: a fundamental trade-off exists between generalization and specificity. This is the first work to quantitatively characterize hierarchical failure modes of refusal mechanisms, establishing a structured, graph-aware evaluation paradigm for trustworthy AI.

Technology Category

Application Category

πŸ“ Abstract
To deploy language models safely, it is crucial that they abstain from responding to inappropriate requests. Several prior studies test the safety promises of models based on their effectiveness in blocking malicious requests. In this work, we focus on evaluating the underlying techniques that cause models to abstain. We create SELECT, a benchmark derived from a set of benign concepts (e.g.,"rivers") from a knowledge graph. Focusing on benign concepts isolates the effect of safety training, and grounding these concepts in a knowledge graph allows us to study the generalization and specificity of abstention techniques. Using SELECT, we benchmark different abstention techniques over six open-weight and closed-source models. We find that the examined techniques indeed cause models to abstain with over $80%$ abstention rates. However, these techniques are not as effective for descendants of the target concepts, where abstention rates drop by $19%$. We also characterize the generalization-specificity trade-offs for different techniques. Overall, no single technique is invariably better than others, and our findings inform practitioners of the various trade-offs involved.
Problem

Research questions and friction points this paper is trying to address.

Evaluate abstention techniques in language models
Assess generalization and specificity of abstention
Benchmark techniques using knowledge graph concepts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Knowledge graph guides abstention evaluation
Benchmark SELECT assesses model safety
Techniques show generalization-specificity trade-offs
πŸ”Ž Similar Papers
No similar papers found.