Auto-Prompt Ensemble for LLM Judge

๐Ÿ“… 2025-10-07
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Current LLM evaluation systems suffer from low reliability due to their inability to identify implicit dimensions embedded in human judgments. To address this, we propose the Auto-Prompt Ensemble framework coupled with Collective Confidenceโ€”a confidence-aware modeling method that adaptively discovers missing evaluation dimensions by analyzing failure cases, and integrates prompt engineering, ensemble learning, and test-time computation scaling for dynamic, multi-dimensional assessment. Our key contribution is the first integration of automated implicit-dimension discovery with confidence-driven evaluation, effectively bridging the gap between human and machine evaluation criteria. Under zero-shot settings, our method boosts GPT-4oโ€™s human agreement rate on Reward Bench from 87.2% to 90.5%, significantly improving judgment consistency and robustness.

Technology Category

Application Category

๐Ÿ“ Abstract
We present a novel framework that improves the reliability of LLM judges by selectively augmenting LLM with auxiliary evaluation dimensions. Existing LLM judges often miss crucial evaluation dimensions because they fail to recognize the implicit standards underlying human assessments. To address this challenge, we propose the Auto-Prompt Ensemble (APE), an adaptive framework that automatically learns evaluation dimensions from its failure cases. APE incorporates a confidence-based ensemble mechanism to decide when to adopt the judgments from additional evaluation dimensions through a novel confidence estimation approach called Collective Confidence. Extensive experiments demonstrate that APE improves the reliability of LLM Judge across diverse standard benchmarks. For instance, APE enhances GPT-4o agreement rate on Reward Bench from 87.2% to 90.5% in the zero-shot setting. Overall, APE provides a principled approach for LLM Judge to leverage test-time computation, and bridge the evaluation gap between human and LLM judges.
Problem

Research questions and friction points this paper is trying to address.

Improves reliability of LLM judges through selective augmentation
Automatically learns evaluation dimensions from failure cases
Bridges evaluation gap between human and LLM judges
Innovation

Methods, ideas, or system contributions that make the work stand out.

Auto-Prompt Ensemble learns evaluation dimensions from failure cases
Confidence-based ensemble decides when to adopt additional judgments
Collective Confidence estimation improves LLM judge reliability
๐Ÿ”Ž Similar Papers
No similar papers found.