🤖 AI Summary
To address security vulnerabilities in large language models (LLMs) during in-context learning—specifically their susceptibility to adversarial or erroneous demonstration examples—this paper proposes a Distribution-Free Risk Control (DFRC) framework. DFRC integrates dynamic early-exit prediction with selective attention head masking, enabling real-time identification and suppression of harmful context under a zero-shot safety baseline. Its key innovation lies in the first application of distribution-free risk control to in-context learning: it requires no assumptions about the attack distribution while jointly optimizing safety and inference efficiency. Experiments demonstrate that DFRC robustly mitigates diverse malicious example attacks, significantly alleviates accuracy degradation, accelerates inference by up to 32% on benign inputs, and improves average accuracy by 1.8%.
📝 Abstract
Large language models (LLMs) demonstrate a remarkable ability to learn new tasks from a few in-context examples. However, this flexibility introduces safety concerns: LLMs can be influenced by incorrect or malicious demonstrations -- for example, if an adversary tampers with or injects harmful examples without a human supervisor noticing. This motivates principled designs in which the system itself includes built-in mechanisms to guard against such attacks. We propose a novel approach to limit the degree to which harmful demonstrations can degrade model performance. First, we define a baseline ``safe'' behavior for the model -- the model's performance given no in-context demonstrations (zero-shot). Next, we apply distribution-free risk control (DFRC) to control the extent to which in-context samples can decay performance below zero-shot. We achieve this by leveraging dynamic early exit prediction, ignoring later attention heads that attend the most to the unsafe inputs. Finally, we propose modifications to DFRC that allow it to both control risk for harmful inputs extit{and} leverage performance and efficiency gains on helpful inputs. We present both theoretical and empirical results showing that our approach can effectively control risk for harmful in-context demonstrations while simultaneously achieving substantial computational efficiency gains with helpful demonstrations.