SAE as a Crystal Ball: Interpretable Features Predict Cross-domain Transferability of LLMs without Training

📅 2026-03-03

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the unpredictability and lack of interpretability in cross-domain transfer performance of large language models after post-training. To this end, the authors propose a transferability scoring metric, STS, based on sparse autoencoders (SAEs). For the first time, SAEs are leveraged to extract interpretable representational features, enabling prediction of transfer performance without actual fine-tuning by analyzing post-training–induced dimensional shifts and their correlation with the target domain. The method is validated across multiple models and domains, achieving Pearson correlation coefficients exceeding 0.7 between STS scores and actual performance changes. Furthermore, the approach is preliminarily extended to reinforcement learning scenarios, establishing a new paradigm for efficient and interpretable transfer evaluation.

Technology Category

Application Category

📝 Abstract

In recent years, pre-trained large language models have achieved remarkable success across diverse tasks. Besides the pivotal role of self-supervised pre-training, their effectiveness in downstream applications also depends critically on the post-training process, which adapts models to task-specific data and objectives. However, this process inevitably introduces model shifts that can influence performance in different domains, and how such shifts transfer remains poorly understood. To open up the black box, we propose the SAE-based Transferability Score (STS), a new metric that leverages sparse autoencoders (SAEs) to forecast post-training transferability. Taking supervised fine-tuning as an example, STS identifies shifted dimensions in SAE representations and calculates their correlations with downstream domains, enabling reliable estimation of transferability \textit{before} fine-tuning. Extensive experiments across multiple models and domains show that STS accurately predicts the transferability of supervised fine-tuning, achieving Pearson correlation coefficients above 0.7 with actual performance changes. Beyond this, we take an initial step toward extending STS to reinforcement learning. We believe that STS can serve as an {\color{black} interpretable} tool for guiding post-training strategies in LLMs. Code is available at https://github.com/PKU-ML/STS.

Problem

Research questions and friction points this paper is trying to address.

transferability

large language models

post-training

cross-domain

interpretability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Autoencoders

Transferability Prediction

Interpretable Features