🤖 AI Summary
Conventional correlation-based feature attribution methods exhibit weak generalization for predicting large language models’ (LLMs) out-of-distribution (OOD) behavior. Method: This work pioneers the direct application of internal causal analysis to OOD behavior prediction, proposing two causal-driven approaches: counterfactual simulation and neuron-level value probing. Leveraging causal interventions, counterfactual generation, and AUC-ROC evaluation, we assess the predictive power of causal features on output correctness across symbolic manipulation, knowledge retrieval, and instruction-following tasks. Contribution/Results: Our method achieves high in-distribution AUC—significantly surpassing non-causal baselines—and demonstrates substantially improved predictive performance under OOD conditions. These results empirically validate that internal causal mechanisms provide robust, generalizable signals for forecasting LLM behavior beyond the training distribution.
📝 Abstract
Interpretability research now offers a variety of techniques for identifying abstract internal mechanisms in neural networks. Can such techniques be used to predict how models will behave on out-of-distribution examples? In this work, we provide a positive answer to this question. Through a diverse set of language modeling tasks--including symbol manipulation, knowledge retrieval, and instruction following--we show that the most robust features for correctness prediction are those that play a distinctive causal role in the model's behavior. Specifically, we propose two methods that leverage causal mechanisms to predict the correctness of model outputs: counterfactual simulation (checking whether key causal variables are realized) and value probing (using the values of those variables to make predictions). Both achieve high AUC-ROC in distribution and outperform methods that rely on causal-agnostic features in out-of-distribution settings, where predicting model behaviors is more crucial. Our work thus highlights a novel and significant application for internal causal analysis of language models.