ContextBench: Modifying Contexts for Targeted Latent Activation

📅 2025-06-15

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the controllable activation of specific latent features or behaviors in language models—i.e., generating semantically coherent and fluent input contexts that efficiently trigger target behaviors. To this end, we introduce ContextBench, the first benchmark to formally define the “context modification” paradigm and establish a dual-objective evaluation framework balancing activation strength and linguistic fluency. Methodologically, we integrate evolutionary prompt optimization (EPO), LLM-assisted search, and diffusion-based inpainting for context editing. Notably, our LLM-augmented EPO and diffusion inpainting variants overcome the longstanding trade-off between activation strength and fluency inherent in prior approaches. Experiments demonstrate state-of-the-art performance on ContextBench, achieving substantial improvements in both latent feature activation success rate and joint optimization of behavioral fidelity and text quality.

Technology Category

Application Category

📝 Abstract

Identifying inputs that trigger specific behaviours or latent features in language models could have a wide range of safety use cases. We investigate a class of methods capable of generating targeted, linguistically fluent inputs that activate specific latent features or elicit model behaviours. We formalise this approach as context modification and present ContextBench -- a benchmark with tasks assessing core method capabilities and potential safety applications. Our evaluation framework measures both elicitation strength (activation of latent features or behaviours) and linguistic fluency, highlighting how current state-of-the-art methods struggle to balance these objectives. We enhance Evolutionary Prompt Optimisation (EPO) with LLM-assistance and diffusion model inpainting, and demonstrate that these variants achieve state-of-the-art performance in balancing elicitation effectiveness and fluency.

Problem

Research questions and friction points this paper is trying to address.

Identify inputs triggering specific model behaviors

Develop methods for targeted latent feature activation

Balance elicitation strength and linguistic fluency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Enhancing EPO with LLM assistance

Using diffusion model inpainting

Balancing elicitation and fluency

🔎 Similar Papers

No similar papers found.