Who's the MVP? A Game-Theoretic Evaluation Benchmark for Modular Attribution in LLM Agents

📅 2025-02-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Quantifying the marginal contribution of individual modules—such as planning, reasoning, and execution—to the overall performance of large language model (LLM) agents remains challenging, hindering interpretability and principled modular design. Method: We introduce CapaBench, the first benchmark to apply Shapley value from cooperative game theory for module-wise capability attribution in LLM agents. It employs module-swapping experiments and multi-round, cross-domain task sampling to yield measurable, comparable, and optimizable module contributions. A standardized evaluation dataset comprising over one thousand samples is constructed and validated across diverse agent architectures. Contribution/Results: CapaBench demonstrates consistent attribution across architectures; high-Shapley-value module combinations exhibit predictable performance gains. The framework significantly improves the reliability of module replacement decisions and establishes a new paradigm for decompositional analysis and controllable optimization of LLM agents.

Technology Category

Application Category

📝 Abstract
Large Language Model (LLM) agents frameworks often employ modular architectures, incorporating components such as planning, reasoning, action execution, and reflection to tackle complex tasks. However, quantifying the contribution of each module to overall system performance remains a significant challenge, impeding optimization and interpretability. To address this, we introduce CapaBench (Capability-level Assessment Benchmark), an evaluation framework grounded in cooperative game theory's Shapley Value, which systematically measures the marginal impact of individual modules and their interactions within an agent's architecture. By replacing default modules with test variants across all possible combinations, CapaBench provides a principle method for attributing performance contributions. Key contributions include: (1) We are the first to propose a Shapley Value-based methodology for quantifying the contributions of capabilities in LLM agents; (2) Modules with high Shapley Values consistently lead to predictable performance gains when combined, enabling targeted optimization; and (3) We build a multi-round dataset of over 1,000 entries spanning diverse domains and practical task scenarios, enabling comprehensive evaluation of agent capabilities. CapaBench bridges the gap between component-level evaluation and holistic system assessment, providing actionable insights for optimizing modular LLM agents and advancing their deployment in complex, real-world scenarios.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Modular Components
Performance Optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

CapaBench
Shapley Value
Modular Contribution Quantification
🔎 Similar Papers
No similar papers found.