Revitalizing Saturated Benchmarks: A Weighted Metric Approach for Differentiating Large Language Model Performance

📅 2025-03-07

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Existing LLM evaluation benchmarks suffer from data contamination and diminishing discriminative power due to rapid model advancements, leading to performance saturation and inadequate fine-grained differentiation. Method: We propose the EMDM-weighted evaluation metric, the first to decouple Chain-of-Thought (CoT) reasoning quality from final answer correctness. It dynamically assigns sample-level weights based on reasoning depth and complexity, optimized under both unguided and guided inference settings. The method integrates dual-mode CoT evaluation (baseline LLMs), complexity-aware weighted aggregation, and an EMDM-designed objective function. Contribution/Results: On ARC-Challenge, our metric achieves a model separation rate of 46%, outperforming traditional exact-match (EM) by 29 percentage points (17% → 46%), thereby substantially enhancing fine-grained model discrimination.

Technology Category

Application Category

📝 Abstract

Existing benchmarks are becoming saturated and struggle to separate model performances due to factors like data contamination and advancing LLM capabilities. This paper introduces EMDM (Enhanced Model Differentiation Metric), a novel weighted metric that revitalizes benchmarks by enhancing model separation. EMDM integrates final answer and Chain-of-Thought (CoT) reasoning correctness, assigning weights based on the complexity and reasoning depth required to solve a given sample in the evaluation data. Using a baseline LLM in two setups-Unguided, where the model has no prior exposure to test samples, and Guided, where the model has prior knowledge of the desired answer-EMDM distinguishes instances of varying difficulty. The CoT and answer correctness from these setups inform an optimization objective for weight assignment, resulting in a more nuanced evaluation of model performance. Compared to the exact match (EM) metric, which achieves 17% separation on ARC-Challenge, EMDM achieves 46%, demonstrating its effectiveness in differentiating models based on reasoning and knowledge requirements.

Problem

Research questions and friction points this paper is trying to address.

Existing benchmarks fail to differentiate large language model performances effectively.

Introduces EMDM, a weighted metric to enhance model separation in benchmarks.

EMDM evaluates models based on reasoning complexity and answer correctness.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces EMDM for enhanced model differentiation

Weights based on complexity and reasoning depth

Improves separation from 17% to 46% on ARC-Challenge

🔎 Similar Papers

No similar papers found.