WASD: Locating Critical Neurons as Sufficient Conditions for Explaining and Controlling LLM Behavior

📅 2026-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing approaches to controlling the behavior of large language models often suffer from high training costs, weak controllability via natural language, or insufficient semantic coherence. This work proposes a novel mechanism that treats neuron activation predicates as actionable sufficient conditions for desired behaviors. By employing an iterative search algorithm to verify output stability under input perturbations, the method identifies the minimal set of critical neurons responsible for generating specific behaviors, thereby enabling a concise, stable, and interpretable control framework. Experiments on the Gemma-2-2B model using the SST-2 and CounterFact datasets demonstrate that the approach outperforms conventional attribution maps in explanation accuracy and successfully achieves cross-lingual generation control guided by natural language instructions.

Technology Category

Application Category

📝 Abstract
Precise behavioral control of large language models (LLMs) is critical for complex applications. However, existing methods often incur high training costs, lack natural language controllability, or compromise semantic coherence. To bridge this gap, we propose WASD (unWeaving Actionable Sufficient Directives), a novel framework that explains model behavior by identifying sufficient neural conditions for token generation. Our method represents candidate conditions as neuron-activation predicates and iteratively searches for a minimal set that guarantees the current output under input perturbations. Experiments on SST-2 and CounterFact with the Gemma-2-2B model demonstrate that our approach produces explanations that are more stable, accurate, and concise than conventional attribution graphs. Moreover, through a case study on controlling cross-lingual output generation, we validated the practical effectiveness of WASD in controlling model behavior.
Problem

Research questions and friction points this paper is trying to address.

large language models
behavior control
model interpretability
neural conditions
token generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

sufficient conditions
neuron activation
behavior control
LLM interpretability
minimal explanation
🔎 Similar Papers
No similar papers found.
Haonan Yu
Haonan Yu
Research Scientist, Skild AI
RoboticsDeep Reinforcement LearningMultimodal Learning
Junhao Liu
Junhao Liu
Ph.D Candidate, Peking University
Explanable AIAI safetyLLMVLM
Z
Zhenyu Yan
Key Lab of High Confidence Software Technologies (Peking University), Ministry of Education, School of Computer Science, Peking University, Beijing, China
H
Haoran Lin
Key Lab of High Confidence Software Technologies (Peking University), Ministry of Education, School of Computer Science, Peking University, Beijing, China
Xin Zhang
Xin Zhang
Peking University
gerontologyageism