🤖 AI Summary
This paper addresses the critical challenge of undefined behavioral boundaries in large language models (LLMs) during real-world deployment. We propose the “model scoping” paradigm—constraining LLMs to respond exclusively to domain-specific queries (e.g., document QA, programming assistance) while proactively refusing out-of-scope requests (e.g., poetry generation, physics Q&A). Methodologically, we introduce a novel hierarchical alignment strategy integrating supervised fine-tuning, preference learning, prompt engineering, and plug-and-play Circuit Breakers (CBs), significantly enhancing refusal robustness under low-diversity adversarial interference. To our knowledge, this is the first work to formally define, systematically evaluate, and empirically validate model scoping. Extensive experiments across three major LLM families, multiple downstream tasks, and adversarial benchmarks demonstrate its efficacy: multi-sample fine-tuning achieves >92% refusal accuracy; under data scarcity, CBs alone improve refusal rates by 37%; and the integrated approach delivers both high precision and strong generalization.
📝 Abstract
We now deploy language models in a wide variety of user-facing applications. Typically, these deployments have some specific purpose, like answering questions about documentation or acting as coding assistants, but they require general language understanding. Under these circumstances these models should not be able to answer irrelevant requests such as, poetry generation or questions about physics, etc. Instead we would like language models to only answer to queries corresponding to desired behavior and refuse all other requests, which we refer to as scoping. We conduct a comprehensive empirical evaluation of potential methods from prompting to fine-tuning to preference learning to a recently proposed method for general alignment called Circuit Breakers (CB). Across three families of language models and a broad variety of tasks, we show that it is possible to scope language models. We examine scoping for multiple topics, and fine-grained topics. We ablate diversity of irrelevant queries, layer different techniques, conduct adversarial evaluations and more. Among other results, we find that, when diverse examples of irrelevant queries are available, simple supervised fine-tuning produces the best results, but when such diversity is low, Circuit Breakers perform quite well. One can often get the benefits of both methods by layering them in succession. We intend our study to serve as a practitioner's guide to scoping language models.