Towards Transparent Reasoning: What Drives Faithfulness in Large Language Models?

📅 2025-10-28

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Large language models (LLMs) frequently generate explanations lacking faithfulness—particularly in high-stakes domains like healthcare—where critical clinical cues may be omitted or spurious shortcuts concealed, undermining trustworthiness and safety. Method: This work systematically investigates deployable, inference-time levers affecting explanation faithfulness, focusing on three controllable factors: few-shot example quality and quantity, prompt engineering design, and instruction fine-tuning. Empirical evaluation is conducted across BBQ, Social Bias, and MedQA benchmarks using GPT-4.1-mini and LLaMA-70B/8B. Contribution/Results: We demonstrate that carefully curated few-shot examples, structured prompt engineering, and domain-targeted instruction fine-tuning significantly improve both explanation faithfulness and decision reliability. To our knowledge, this is the first study to quantitatively validate, in sensitive domains, the malleability of explanation fidelity through inference-stage interventions. Our findings yield reproducible, production-ready optimization strategies for building trustworthy, controllable AI decision-support systems in clinical settings.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) often produce explanations that do not faithfully reflect the factors driving their predictions. In healthcare settings, such unfaithfulness is especially problematic: explanations that omit salient clinical cues or mask spurious shortcuts can undermine clinician trust and lead to unsafe decision support. We study how inference and training-time choices shape explanation faithfulness, focusing on factors practitioners can control at deployment. We evaluate three LLMs (GPT-4.1-mini, LLaMA 70B, LLaMA 8B) on two datasets-BBQ (social bias) and MedQA (medical licensing questions), and manipulate the number and type of few-shot examples, prompting strategies, and training procedure. Our results show: (i) both the quantity and quality of few-shot examples significantly impact model faithfulness; (ii) faithfulness is sensitive to prompting design; (iii) the instruction-tuning phase improves measured faithfulness on MedQA. These findings offer insights into strategies for enhancing the interpretability and trustworthiness of LLMs in sensitive domains.

Problem

Research questions and friction points this paper is trying to address.

Identifying factors affecting LLM explanation faithfulness

Studying inference and training choices for model transparency

Enhancing interpretability of LLMs in sensitive domains

Innovation

Methods, ideas, or system contributions that make the work stand out.

Few-shot examples quantity and quality enhance faithfulness

Prompting design choices significantly impact model faithfulness

Instruction-tuning phase improves faithfulness in medical domain

🔎 Similar Papers

FaithLM: Towards Faithful Explanations for Large Language Models