Hedonic Neurons: A Mechanistic Mapping of Latent Coalitions in Transformer MLPs

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

While fine-tuning MLP layers of large language models (LLMs) encodes rich task-specific features, the underlying neuron collaboration mechanisms remain poorly understood. Method: We propose the first mechanism interpretability framework grounded in cooperative game theory: modeling neuron preferences as hedonic games and introducing stable coalition detection to identify non-additive neuronal collaborations, revealing their cross-layer dynamic evolution. Our approach integrates LoRA update analysis, top-responsive utility computation, the PAC-Top-Cover algorithm for stable coalition extraction, and inter-layer tracking. Results: Evaluated on LLaMA, Mistral, and Pythia reranking models, the discovered coalitions exhibit significantly stronger collaboration than clustering-based baselines—demonstrating both functional importance and cross-domain predictability. This work establishes a novel paradigm for understanding high-order representations in MLP layers.

Technology Category

Application Category

📝 Abstract

Fine-tuned Large Language Models (LLMs) encode rich task-specific features, but the form of these representations, especially within MLP layers, remains unclear. Empirical inspection of LoRA updates shows that new features concentrate in mid-layer MLPs, yet the scale of these layers obscures meaningful structure. Prior probing suggests that statistical priors may strengthen, split, or vanish across depth, motivating the need to study how neurons work together rather than in isolation. We introduce a mechanistic interpretability framework based on coalitional game theory, where neurons mimic agents in a hedonic game whose preferences capture their synergistic contributions to layer-local computations. Using top-responsive utilities and the PAC-Top-Cover algorithm, we extract stable coalitions of neurons: groups whose joint ablation has non-additive effects. We then track their transitions across layers as persistence, splitting, merging, or disappearance. Applied to LLaMA, Mistral, and Pythia rerankers fine-tuned on scalar IR tasks, our method finds coalitions with consistently higher synergy than clustering baselines. By revealing how neurons cooperate to encode features, hedonic coalitions uncover higher-order structure beyond disentanglement and yield computational units that are functionally important, interpretable, and predictive across domains.

Problem

Research questions and friction points this paper is trying to address.

Mapping synergistic neuron interactions in fine-tuned LLMs

Tracking feature evolution across transformer MLP layers

Revealing higher-order computational structure beyond disentanglement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Coalitional game theory models neuron interactions

PAC-Top-Cover algorithm extracts stable neuron groups

Tracking neuron coalitions across transformer layers

🔎 Similar Papers

No similar papers found.