Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation

📅 2026-05-04

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the challenge of expressing the decision logic of large language models (LLMs) as symbolic rules grounded in their neural mechanisms. The authors propose MechaRule, a novel framework that, for the first time, aligns rule extraction with neuron-level mechanistic interpretation. Built upon a monotonic coverage assumption, MechaRule employs an adaptive group testing strategy combined with contrastive hierarchical ablation and spectral data partitioning to efficiently identify sparse sets of critical neurons—termed agonists—while substantially reducing intervention overhead and enhancing localization reliability. Experiments demonstrate that MechaRule achieves a rule recall rate of 96.8% on Qwen2 and GPT-J. Furthermore, suppressing the identified neurons reduces arithmetic accuracy by up to 71.1% and jailbreak success rates by up to 8.8%, validating the functional significance of the extracted rules and their neural substrates.

📝 Abstract

A key goal of explainable AI (XAI) is to express the decision logic of large language models (LLMs) in symbolic form and link it to internal mechanisms. Global rule-extraction methods typically learn symbolic surrogates without grounding rules in model circuitry, while mechanistic interpretability can connect behaviors to neuron sets but often depends on hand-crafted hypotheses and expensive neuron-level interventions. We introduce MechaRule, a pipeline that grounds rule extraction in LLM circuits by efficiently localizing sparse neurons called agonists, whose activation neutralization disrupts rule-related behaviors. MechaRule rests on two empirical observations. First, within a fixed baseline/flip regime, sparse agonist effects can be approximately monotone and saturating: a few dominant neuron activations can overtop weaker ones at coarse scales, while overlapping neurons flip many of the same examples. This motivates viewing localization as adaptive group testing driven by a regime-conditional strength predicate with confidence-guided conservative pruning, yielding Theta(k log(N/k) + k) interventions over N candidates when k << N neurons are agonists under the monotone-overtopping abstraction. Second, agonists emerge more reliably when ablations are verified through data splits aligned with close-to-faithful rule behavior; spectral splits remain a useful rule-free fallback, while unfaithful splits degrade localization. Empirically, overtopping appears mainly in learned, task-aligned regimes: on arithmetic and jailbreak tasks across Qwen2 and GPT-J, MechaRule recalls 96.8% of high-effect brute-force agonists in completed comparisons, and suppressing localized agonists reduces arithmetic accuracy and jailbreak success by up to 71.1% and 8.8%, respectively.

Problem

Research questions and friction points this paper is trying to address.

explainable AI

rule extraction

mechanistic interpretability

large language models

neuron localization

Innovation

Methods, ideas, or system contributions that make the work stand out.

neuron-anchored rule extraction

contrastive hierarchical ablation

mechanistic interpretability