Safe and Efficient In-Context Learning via Risk Control

📅 2025-10-02

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

To address security vulnerabilities in large language models (LLMs) during in-context learning—specifically their susceptibility to adversarial or erroneous demonstration examples—this paper proposes a Distribution-Free Risk Control (DFRC) framework. DFRC integrates dynamic early-exit prediction with selective attention head masking, enabling real-time identification and suppression of harmful context under a zero-shot safety baseline. Its key innovation lies in the first application of distribution-free risk control to in-context learning: it requires no assumptions about the attack distribution while jointly optimizing safety and inference efficiency. Experiments demonstrate that DFRC robustly mitigates diverse malicious example attacks, significantly alleviates accuracy degradation, accelerates inference by up to 32% on benign inputs, and improves average accuracy by 1.8%.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) demonstrate a remarkable ability to learn new tasks from a few in-context examples. However, this flexibility introduces safety concerns: LLMs can be influenced by incorrect or malicious demonstrations -- for example, if an adversary tampers with or injects harmful examples without a human supervisor noticing. This motivates principled designs in which the system itself includes built-in mechanisms to guard against such attacks. We propose a novel approach to limit the degree to which harmful demonstrations can degrade model performance. First, we define a baseline ``safe'' behavior for the model -- the model's performance given no in-context demonstrations (zero-shot). Next, we apply distribution-free risk control (DFRC) to control the extent to which in-context samples can decay performance below zero-shot. We achieve this by leveraging dynamic early exit prediction, ignoring later attention heads that attend the most to the unsafe inputs. Finally, we propose modifications to DFRC that allow it to both control risk for harmful inputs extit{and} leverage performance and efficiency gains on helpful inputs. We present both theoretical and empirical results showing that our approach can effectively control risk for harmful in-context demonstrations while simultaneously achieving substantial computational efficiency gains with helpful demonstrations.

Problem

Research questions and friction points this paper is trying to address.

Limiting performance degradation from harmful in-context demonstrations

Controlling risk while maintaining computational efficiency gains

Ensuring safe model behavior against adversarial input manipulations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic early exit prediction ignores unsafe attention heads

Distribution-free risk control maintains baseline safe behavior

Modified DFRC balances risk control with computational efficiency

🔎 Similar Papers

Context-aware LLM-based Safe Control Against Latent Risks