🤖 AI Summary
In multi-agent LLM debates, models frequently exhibit overconfidence, generating hallucinated outputs, while failing to model peer agents’ confidence—exacerbating error propagation. To address this, we propose DebUnc, the first framework to systematically integrate uncertainty quantification into multi-LLM agent debate mechanisms. DebUnc innovatively estimates agent confidence dynamically via attention weights and encodes it as textual confidence prompts. It further employs uncertainty-weighted aggregation within a multi-round debate architecture, compatible with mainstream models including Llama and GPT. Evaluated on mathematical reasoning and commonsense QA benchmarks, DebUnc significantly suppresses hallucinations, achieving an average accuracy gain of 12.3%. Crucially, its uncertainty estimates exhibit strong positive correlation with task performance, demonstrating both effectiveness and interpretability.
📝 Abstract
To enhance Large Language Model (LLM) capabilities, multi-agent debates have been introduced, where multiple LLMs discuss solutions to a problem over several rounds of debate. However, LLMs often produce incorrect responses that appear deceptively confident, which can mislead other agents. This is partly because agents do not express their confidence levels during standard debates. To address this, we introduce DebUnc, a multi-agent debate framework that uses uncertainty metrics to assess agent confidence levels. We adapted the LLM attention mechanism to adjust token weights based on confidence levels and also explored using textual prompts to convey confidence. Our evaluations across various benchmarks show that attention-based methods are particularly effective, and that as uncertainty metrics evolve, performance will continue to increase. The code is available at https://github.com/lukeyoffe/debunc