🤖 AI Summary
This work addresses the controllable activation of specific latent features or behaviors in language models—i.e., generating semantically coherent and fluent input contexts that efficiently trigger target behaviors. To this end, we introduce ContextBench, the first benchmark to formally define the “context modification” paradigm and establish a dual-objective evaluation framework balancing activation strength and linguistic fluency. Methodologically, we integrate evolutionary prompt optimization (EPO), LLM-assisted search, and diffusion-based inpainting for context editing. Notably, our LLM-augmented EPO and diffusion inpainting variants overcome the longstanding trade-off between activation strength and fluency inherent in prior approaches. Experiments demonstrate state-of-the-art performance on ContextBench, achieving substantial improvements in both latent feature activation success rate and joint optimization of behavioral fidelity and text quality.
📝 Abstract
Identifying inputs that trigger specific behaviours or latent features in language models could have a wide range of safety use cases. We investigate a class of methods capable of generating targeted, linguistically fluent inputs that activate specific latent features or elicit model behaviours. We formalise this approach as context modification and present ContextBench -- a benchmark with tasks assessing core method capabilities and potential safety applications. Our evaluation framework measures both elicitation strength (activation of latent features or behaviours) and linguistic fluency, highlighting how current state-of-the-art methods struggle to balance these objectives. We enhance Evolutionary Prompt Optimisation (EPO) with LLM-assistance and diffusion model inpainting, and demonstrate that these variants achieve state-of-the-art performance in balancing elicitation effectiveness and fluency.