🤖 AI Summary
This work addresses the challenge of quantifying individual document contributions to LLM-generated summaries—a critical issue for fair credit attribution and compensation of content creators. We propose Cluster Shapley, a novel method that integrates semantic clustering with Shapley value computation: first, document embeddings are grouped into semantically coherent clusters to reduce the dimensionality of the cooperative game space; then, approximate Shapley values are efficiently computed within each cluster. Evaluated on Amazon product review summarization, Cluster Shapley achieves superior efficiency–accuracy trade-offs versus Monte Carlo Shapley and Kernel SHAP—3.2× speedup and 27% lower error—while remaining agnostic to specific LLMs or summarization pipelines. Its core innovation lies in being the first to incorporate semantic structure into the Shapley attribution framework, enabling scalable, interpretable, and model-agnostic quantification of document-level contribution.
📝 Abstract
Large Language Models (LLMs) are increasingly used in systems that retrieve and summarize content from multiple sources, such as search engines and AI assistants. While these models enhance user experience by generating coherent summaries, they obscure the contributions of original content creators, raising concerns about credit attribution and compensation. We address the challenge of valuing individual documents used in LLM-generated summaries. We propose using Shapley values, a game-theoretic method that allocates credit based on each document's marginal contribution. Although theoretically appealing, Shapley values are expensive to compute at scale. We therefore propose Cluster Shapley, an efficient approximation algorithm that leverages semantic similarity between documents. By clustering documents using LLM-based embeddings and computing Shapley values at the cluster level, our method significantly reduces computation while maintaining attribution quality. We demonstrate our approach to a summarization task using Amazon product reviews. Cluster Shapley significantly reduces computational complexity while maintaining high accuracy, outperforming baseline methods such as Monte Carlo sampling and Kernel SHAP with a better efficient frontier. Our approach is agnostic to the exact LLM used, the summarization process used, and the evaluation procedure, which makes it broadly applicable to a variety of summarization settings.