WASD: Locating Critical Neurons as Sufficient Conditions for Explaining and Controlling LLM Behavior

📅 2026-03-19

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Existing approaches to controlling the behavior of large language models often suffer from high training costs, weak controllability via natural language, or insufficient semantic coherence. This work proposes a novel mechanism that treats neuron activation predicates as actionable sufficient conditions for desired behaviors. By employing an iterative search algorithm to verify output stability under input perturbations, the method identifies the minimal set of critical neurons responsible for generating specific behaviors, thereby enabling a concise, stable, and interpretable control framework. Experiments on the Gemma-2-2B model using the SST-2 and CounterFact datasets demonstrate that the approach outperforms conventional attribution maps in explanation accuracy and successfully achieves cross-lingual generation control guided by natural language instructions.

Technology Category

Application Category

📝 Abstract

Precise behavioral control of large language models (LLMs) is critical for complex applications. However, existing methods often incur high training costs, lack natural language controllability, or compromise semantic coherence. To bridge this gap, we propose WASD (unWeaving Actionable Sufficient Directives), a novel framework that explains model behavior by identifying sufficient neural conditions for token generation. Our method represents candidate conditions as neuron-activation predicates and iteratively searches for a minimal set that guarantees the current output under input perturbations. Experiments on SST-2 and CounterFact with the Gemma-2-2B model demonstrate that our approach produces explanations that are more stable, accurate, and concise than conventional attribution graphs. Moreover, through a case study on controlling cross-lingual output generation, we validated the practical effectiveness of WASD in controlling model behavior.

Problem

Research questions and friction points this paper is trying to address.

large language models

behavior control

model interpretability

neural conditions

token generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

sufficient conditions

neuron activation

behavior control