🤖 AI Summary
Existing approaches to controlling the behavior of large language models often suffer from high training costs, weak controllability via natural language, or insufficient semantic coherence. This work proposes a novel mechanism that treats neuron activation predicates as actionable sufficient conditions for desired behaviors. By employing an iterative search algorithm to verify output stability under input perturbations, the method identifies the minimal set of critical neurons responsible for generating specific behaviors, thereby enabling a concise, stable, and interpretable control framework. Experiments on the Gemma-2-2B model using the SST-2 and CounterFact datasets demonstrate that the approach outperforms conventional attribution maps in explanation accuracy and successfully achieves cross-lingual generation control guided by natural language instructions.
📝 Abstract
Precise behavioral control of large language models (LLMs) is critical for complex applications. However, existing methods often incur high training costs, lack natural language controllability, or compromise semantic coherence. To bridge this gap, we propose WASD (unWeaving Actionable Sufficient Directives), a novel framework that explains model behavior by identifying sufficient neural conditions for token generation. Our method represents candidate conditions as neuron-activation predicates and iteratively searches for a minimal set that guarantees the current output under input perturbations. Experiments on SST-2 and CounterFact with the Gemma-2-2B model demonstrate that our approach produces explanations that are more stable, accurate, and concise than conventional attribution graphs. Moreover, through a case study on controlling cross-lingual output generation, we validated the practical effectiveness of WASD in controlling model behavior.