Unifying and Optimizing Data Values for Selection via Sequential-Decision-Making

📅 2025-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses two key challenges in data selection: (1) the lack of a unified theoretical foundation for data value modeling, and (2) high computational complexity. To tackle these, we propose a sequential decision-making framework that formally casts data selection as a dynamic programming problem—yielding a principled definition of optimal data value. This formulation unifies and interprets existing methods (e.g., Data Shapley) and establishes theoretical optimality guarantees for greedy selection under monotone submodular utility. Furthermore, we design a bipartite graph neural network to learn a surrogate utility function and integrate it with approximate dynamic programming for scalable inference. Extensive experiments across diverse datasets demonstrate substantial improvements in selection quality. Our approach bridges rigorous theoretical guarantees with practical scalability, offering a novel paradigm for quantifying data value and enabling efficient, principled data selection.

Technology Category

Application Category

📝 Abstract
Data selection has emerged as a crucial downstream application of data valuation. While existing data valuation methods have shown promise in selection tasks, the theoretical foundations and full potential of using data values for selection remain largely unexplored. In this work, we first demonstrate that data values applied for selection can be naturally reformulated as a sequential-decision-making problem, where the optimal data value can be derived through dynamic programming. We show this framework unifies and reinterprets existing methods like Data Shapley through the lens of approximate dynamic programming, specifically as myopic reward function approximations to this sequential problem. Furthermore, we analyze how sequential data selection optimality is affected when the ground-truth utility function exhibits monotonic submodularity with curvature. To address the computational challenges in obtaining optimal data values, we propose an efficient approximation scheme using learned bipartite graphs as surrogate utility models, ensuring greedy selection is still optimal when the surrogate utility is correctly specified and learned. Extensive experiments demonstrate the effectiveness of our approach across diverse datasets.
Problem

Research questions and friction points this paper is trying to address.

Unifying data valuation for selection tasks.
Optimizing data selection via dynamic programming.
Addressing computational challenges with approximation schemes.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reformulates data selection as sequential-decision-making
Uses dynamic programming for optimal data value
Proposes efficient approximation with bipartite graphs
🔎 Similar Papers
No similar papers found.
Hongliang Chi
Hongliang Chi
Rensselaer Polytechnic Institute
LLMRLData-Centric AIOptimizationGNN
Q
Qiong Wu
AT&T-Chief Data Office, Bedminster, NJ, United States
Z
Zhengyi Zhou
AT&T-Chief Data Office, Bedminster, NJ, United States
Jonathan Light
Jonathan Light
RPI PhD
Decision making under uncertaintyfoundation modelsreinforcement learning
E
Emily Dodwell
AT&T-Chief Data Office, Bedminster, NJ, United States
Y
Yao Ma
Rensselaer Polytechnic Institute, Troy, NY, United States