Faithful-Patchscopes: Understanding and Mitigating Model Bias in Hidden Representations Explanation of Large Language Models

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This work reveals that large language models (LLMs) exhibit systematic unfaithfulness when generating explanations of their internal representations using Patchscopes, primarily due to interference from inherent linguistic biases that cause outputs to deviate from contextual relevance. To address this issue, the authors propose Bias Alignment through Logit Recalibration (BALOR), a method that recalibrates the generation distribution by comparing output logits under patched and unpatched conditions, thereby suppressing bias and amplifying contextual signals. The study introduces a dedicated evaluation dataset and demonstrates that standard Patchscope-based explanations suffer an average 18.84% drop in faithfulness. In contrast, BALOR consistently outperforms baseline approaches across multiple mainstream LLMs, achieving up to a 33% relative improvement in contextual consistency and significantly enhancing explanation fidelity.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have demonstrated strong capabilities for hidden representation interpretation through Patchscopes, a framework that uses LLMs themselves to generate human-readable explanations by decoding from internal hidden representations. However, our work shows that LLMs tend to rely on inherent linguistic patterns, which can override contextual information encoded in the hidden representations during decoding. For example, even when a hidden representation encodes the contextual attribute"purple"for"broccoli", LLMs still generate"green"in their explanations, reflecting a strong prior association. This behavior reveals a systematic unfaithfulness in Patchscopes. To systematically study this issue, we first designed a dataset to evaluate the faithfulness of Patchscopes under biased cases, and our results show that there is an 18.84\% faithfulness decrease on average. We then propose Bias Alignment through Logit Recalibration (BALOR), which treats the output logits from an unpatched prompt as capturing model bias and contrasts them with logits obtained under patched contextual information. By recalibrating the logit distribution through this contrast, BALOR suppresses model bias and amplifies contextual information during generation. Experiments across multiple LLMs demonstrate that BALOR consistently outperforms existing baselines, achieving up to 33\% relative performance improvement.

Problem

Research questions and friction points this paper is trying to address.

model bias

faithfulness

hidden representations

large language models

explanation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Patchscopes

model bias

logit recalibration