🤖 AI Summary
Existing methods for knowledge-intensive tasks overlook variations in context document credibility, leading to error propagation. This paper proposes CrEst, a weakly supervised framework that automatically estimates context credibility without human annotations. CrEst innovatively models cross-document semantic consistency to quantify credibility, designs a dual-path integration mechanism—comprising black-box and white-box components—to accommodate large language models (LLMs) with varying access permissions, and enhances robustness via attention intervention and consistency-aware aggregation. Evaluated across three model architectures and five benchmark datasets, CrEst achieves up to a 26.86% absolute accuracy gain and a 3.49% F1-score improvement over strong baselines. Notably, it maintains stable performance under high-noise conditions, demonstrating superior reliability and generalizability in credibility-aware reasoning.
📝 Abstract
The integration of contextual information has significantly enhanced the performance of large language models (LLMs) on knowledge-intensive tasks. However, existing methods often overlook a critical challenge: the credibility of context documents can vary widely, potentially leading to the propagation of unreliable information. In this paper, we introduce CrEst, a novel weakly supervised framework for assessing the credibility of context documents during LLM inference--without requiring manual annotations. Our approach is grounded in the insight that credible documents tend to exhibit higher semantic coherence with other credible documents, enabling automated credibility estimation through inter-document agreement. To incorporate credibility into LLM inference, we propose two integration strategies: a black-box approach for models without access to internal weights or activations, and a white-box method that directly modifies attention mechanisms. Extensive experiments across three model architectures and five datasets demonstrate that CrEst consistently outperforms strong baselines, achieving up to a 26.86% improvement in accuracy and a 3.49% increase in F1 score. Further analysis shows that CrEst maintains robust performance even under high-noise conditions.