Training-free LLM Verification via Recycling Few-shot Examples

📅 2025-06-08

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Large language models (LLMs) exhibit stochasticity and inconsistency in reasoning outputs, and existing verification methods—such as majority voting or external verifiers—are limited by poor generalization or reliance on additional training. Method: We propose Referi, a training-free, lightweight verification framework that uniquely reuses the same set of in-context few-shot examples for both response generation and internal verification. Leveraging Bayesian principles, Referi introduces a dual-scoring mechanism that jointly assesses response consistency and confidence, performing verification entirely within the LLM’s inference process. Contribution/Results: Referi requires zero fine-tuning and zero parameter updates. Evaluated across three mainstream LLMs and seven diverse reasoning tasks, it achieves an average accuracy improvement of 4.8%, significantly outperforming majority voting and Best-of-N baselines. Referi establishes a new, efficient, and broadly applicable paradigm for trustworthy LLM reasoning.

Technology Category

Application Category

📝 Abstract

Although LLMs have achieved remarkable performance, the inherent stochasticity of their reasoning process and varying conclusions present significant challenges. Majority voting or Best-of-N with external verification models has been explored to find the most promising solution among multiple LLM outputs. However, these approaches have certain limitations, such as limited applicability or the cost of an additional training step. To address this problem, we propose a novel and effective framework that Recycles Few-shot examples to verify LLM outputs (Referi). Our key idea is to additionally utilize the given few-shot examples to evaluate the candidate outputs of the target query, not only using them to generate outputs as the conventional few-shot prompting setup. Specifically, Referi evaluates the generated outputs by combining two different scores, designed motivated from Bayes'rule, and subsequently selects the candidate that is both confidently determined and contextually coherent through a few additional LLM inferences. Experiments with three different LLMs and across seven diverse tasks demonstrate that our framework significantly improves the accuracy of LLMs-achieving an average gain of 4.8%-through effective response selection, without additional training.

Problem

Research questions and friction points this paper is trying to address.

Verifying LLM outputs without training using few-shot examples

Selecting most reliable responses via Bayesian-inspired scoring mechanism

Improving LLM accuracy through recycling demonstration examples

Innovation

Methods, ideas, or system contributions that make the work stand out.

Recycles few-shot examples for LLM output verification

Combines two scores motivated by Bayes' rule

Selects confident candidates through additional LLM inferences

🔎 Similar Papers

Claim Verification in the Age of Large Language Models: A Survey