🤖 AI Summary
Large language models (LLMs) frequently generate factually inaccurate content—so-called hallucinations—while existing mitigation approaches often rely on external knowledge sources. This paper proposes a knowledge-free, fine-grained cross-model consistency framework that detects and corrects hallucinations without external retrieval. It employs semantically equivalent prompts to elicit responses from multiple black-box LLMs, identifies erroneous spans by analyzing inter-model output discrepancies, and applies targeted revision to preserve correct content. Crucially, the method decouples hallucination detection and correction at the fine-grained semantic level, enabling high-precision, low-disturbance factual enhancement. On the FELM dataset, our approach improves hallucination detection F1-score by 6–39 percentage points; on the GPQA-diamond benchmark, it boosts answer accuracy by 7–8 percentage points for state-of-the-art models including Llama 4 Maverick and Claude 4 Sonnet. These results demonstrate both efficacy and strong generalization across diverse LLMs.
📝 Abstract
Large language models (LLMs) have demonstrated impressive capabilities across diverse tasks, but they remain susceptible to hallucinations--generating content that appears plausible but contains factual inaccuracies. We present Finch-Zk, a black-box framework that leverages FINe-grained Cross-model consistency to detect and mitigate Hallucinations in LLM outputs without requiring external knowledge sources. Finch-Zk introduces two key innovations: 1) a cross-model consistency checking strategy that reveals fine-grained inaccuracies by comparing responses generated by diverse models from semantically-equivalent prompts, and 2) a targeted mitigation technique that applies precise corrections to problematic segments while preserving accurate content. Experiments on the FELM dataset show Finch-Zk improves hallucination detection F1 scores by 6-39% compared to existing approaches. For mitigation, Finch-Zk achieves 7-8 absolute percentage points improvement in answer accuracy on the GPQA-diamond dataset when applied to state-of-the-art models like Llama 4 Maverick and Claude 4 Sonnet. Extensive evaluation across multiple models demonstrates that Finch-Zk provides a practical, deployment-ready safeguard for enhancing factual reliability in production LLM systems.