Did I Faithfully Say What I Thought? Bridging the Gap Between Neural Activity and Self-Explanations in Large Language Models

📅 2025-06-10

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Evaluating the faithfulness of self-generated natural language explanations (self-NLEs) from large language models (LLMs)—i.e., their alignment with the model’s actual internal reasoning—remains challenging due to reliance on behavioral proxies or coarse-grained module attribution, which lack grounding in neural dynamics. Method: We propose the first quantitative faithfulness evaluation framework that directly uses latent state dynamics as a ground-truth benchmark. Unlike prior approaches, it enables differentiable cross-modal alignment between textual explanations and neural activations (hidden-layer representations), integrating attention- and gradient-based attribution for fine-grained, interpretable quantification. Contribution/Results: Validated across multiple LLMs and reasoning tasks, our framework significantly improves faithfulness discrimination over baselines. It provides both an optimization target for generating high-fidelity self-NLEs and an interpretable diagnostic tool for explanation quality, advancing principled evaluation of model introspection.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLM) have demonstrated the capability of generating free text self Natural Language Explanation (self-NLE) to justify their answers. Despite their logical appearance, self-NLE do not necessarily reflect the LLM actual decision-making process, making such explanations unfaithful. While existing methods for measuring self-NLE faithfulness mostly rely on behavioral tests or computational block identification, none of them examines the neural activity underlying the model's reasoning. This work introduces a novel flexible framework for quantitatively measuring the faithfulness of LLM-generated self-NLE by directly comparing the latter with interpretations of the model's internal hidden states. The proposed framework is versatile and provides deep insights into self-NLE faithfulness by establishing a direct connection between self-NLE and model reasoning. This approach advances the understanding of self-NLE faithfulness and provides building blocks for generating more faithful self-NLE.

Problem

Research questions and friction points this paper is trying to address.

Measure faithfulness of LLM self-explanations to actual reasoning

Bridge gap between neural activity and generated explanations

Develop framework comparing hidden states to self-NLEs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Compares self-NLE with hidden states

Quantitatively measures explanation faithfulness

Links neural activity to model reasoning

🔎 Similar Papers

FaithLM: Towards Faithful Explanations for Large Language Models