π€ AI Summary
This work addresses the significant inconsistency in large language modelsβ ability to recall the same factual knowledge across different languages, which complicates efforts to locate internal modules responsible for specific knowledge. The study introduces a scalable knowledge localization framework that leverages cross-lingual inconsistencies as an interpretability tool. By comparing expert activation patterns in successful versus failed multilingual responses within a Mixture-of-Experts architecture and performing statistical analysis of routing logs, the method precisely identifies key experts associated with particular knowledge. Experimental results demonstrate that deactivating only about 20 out of 6,000 identified experts leads to the failure of over 40% of targeted knowledge-based questions, confirming both the effectiveness of the approach and the necessity of the identified experts for accurate knowledge recall.
π Abstract
Modern LLMs continue to exhibit significant variance in behavior across languages, such as being able to recall factual information in some languages but not others. While typically studied as a problem to be mitigated, in this work, we propose leveraging this cross-lingual inconsistency as a tool for interpretability in mixture-of-experts (MoE) LLMs. Our knowledge localization framework contrasts routing for sets of languages where the model correctly recalls information from languages where it fails. This allows us to isolate model components that play a functional role in answering about a piece of knowledge. Our method proceeds in two stages: (1) querying the model with difficult factual questions across a diverse set of languages to generate "success" and "failure" activation buckets and then (2) applying a statistical contrastive analysis to the MoE router logits to identify experts important for knowledge. To validate the necessity of this small number of experts for answering a knowledge question, we deactivate them and re-ask the question. We find that despite only deactivating about 20 out of 6000 experts, the model no longer answers correctly in over 40% of cases. Generally, this method provides a realistic and scalable knowledge localization approach to address increasingly complex LLMs.