🤖 AI Summary
Pointwise large language model (LLM) rankers suffer from limited adherence to standardized comparative guidelines and insufficient capability in holistically evaluating complex passages. To address this, we propose a dynamic multi-perspective evaluation criterion generation method: leveraging prompt engineering to instantiate interpretable, dimension-specific criteria—covering semantics, relevance, structure, and more—in real time, and jointly aggregating scores across these criteria. This mechanism is the first to achieve decomposability, interpretability, and synergistic enhancement in LLM-based evaluation. Evaluated on the BEIR benchmark across eight diverse datasets, our approach significantly improves ranking performance, yielding an average 3.2% relative gain in NDCG@10. Results demonstrate that dynamic, multi-perspective guidance effectively enhances the ranking capability of pointwise LLM rankers.
📝 Abstract
The most recent pointwise Large Language Model (LLM) rankers have achieved remarkable ranking results. However, these rankers are hindered by two major drawbacks: (1) they fail to follow a standardized comparison guidance during the ranking process, and (2) they struggle with comprehensive considerations when dealing with complicated passages. To address these shortcomings, we propose to build a ranker that generates ranking scores based on a set of criteria from various perspectives. These criteria are intended to direct each perspective in providing a distinct yet synergistic evaluation. Our research, which examines eight datasets from the BEIR benchmark demonstrates that incorporating this multi-perspective criteria ensemble approach markedly enhanced the performance of pointwise LLM rankers.