๐ค AI Summary
Large language models (LLMs) frequently generate factually inconsistent outputsโso-called โfactual hallucinations.โ Existing multi-sample self-consistency approaches suffer from high latency and fail to correct high-confidence errors. To address this, we propose an online factual monitoring and tree-based intervention mechanism operating during autoregressive decoding: at each token step, it dynamically assesses the factual plausibility of partial generations, identifies high-risk factual keywords, and triggers localized re-decoding to rectify inconsistencies. Crucially, our method abandons the costly full-sequence resampling paradigm, enabling real-time, fine-grained, low-overhead factual calibration *during* generation. Evaluated across multiple factual consistency benchmarks, our approach achieves significantly higher factual accuracy than self-consistency baselines while reducing inference latency by over 40% and substantially lowering computational overhead.
๐ Abstract
While large language models have demonstrated exceptional performance across a wide range of tasks, they remain susceptible to hallucinations -- generating plausible yet factually incorrect contents. Existing methods to mitigating such risk often rely on sampling multiple full-length generations, which introduces significant response latency and becomes ineffective when the model consistently produces hallucinated outputs with high confidence. To address these limitations, we introduce Monitoring Decoding (MD), a novel framework that dynamically monitors the generation process and selectively applies in-process interventions, focusing on revising crucial tokens responsible for hallucinations. Instead of waiting until completion of multiple full-length generations, we identify hallucination-prone tokens during generation using a monitor function, and further refine these tokens through a tree-based decoding strategy. This approach ensures an enhanced factual accuracy and coherence in the generated output while maintaining efficiency. Experimental results demonstrate that MD consistently outperforms self-consistency-based approaches in both effectiveness and efficiency, achieving higher factual accuracy while significantly reducing computational overhead.