TAROT: Targeted Data Selection via Optimal Transport

📅 2024-11-30

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Traditional greedy data selection methods suffer from limited modeling capacity and bias accumulation when targeting multimodal objective distributions. To address this, we propose an optimal transport-based targeted data selection framework. Our approach is the first to couple whitened feature distances with optimal transport to calibrate high-dimensional influence estimation; it jointly optimizes transport cost and sample selection ratio to automatically determine the optimal subset size—thereby relaxing restrictive linear greedy assumptions. Key innovations include: (1) whitened feature distances that enhance cross-modal comparability; (2) end-to-end joint optimization of the selection distribution and sampling ratio; and (3) explicit adaptation to complex multimodal target structures. Extensive experiments on semantic segmentation, motion prediction, and instruction tuning consistently surpass state-of-the-art methods, yielding significant performance gains on target domains. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

We propose TAROT, a targeted data selection framework grounded in optimal transport theory. Previous targeted data selection methods primarily rely on influence-based greedy heuristics to enhance domain-specific performance. While effective on limited, unimodal data (i.e., data following a single pattern), these methods struggle as target data complexity increases. Specifically, in multimodal distributions, these heuristics fail to account for multiple inherent patterns, leading to suboptimal data selection. This work identifies two primary factors contributing to this limitation: (i) the disproportionate impact of dominant feature components in high-dimensional influence estimation, and (ii) the restrictive linear additive assumptions inherent in greedy selection strategies. To address these challenges, TAROT incorporates whitened feature distance to mitigate dominant feature bias, providing a more reliable measure of data influence. Building on this, TAROT uses whitened feature distance to quantify and minimize the optimal transport distance between the selected data and target domains. Notably, this minimization also facilitates the estimation of optimal selection ratios. We evaluate TAROT across multiple tasks, including semantic segmentation, motion prediction, and instruction tuning. Results consistently show that TAROT outperforms state-of-the-art methods, highlighting its versatility across various deep learning tasks. Code is available at https://github.com/vita-epfl/TAROT.

Problem

Research questions and friction points this paper is trying to address.

Addresses suboptimal data selection in multimodal distributions

Mitigates dominant feature bias in influence estimation

Minimizes optimal transport distance for targeted data selection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses optimal transport theory for data selection

Incorporates whitened feature distance to reduce bias

Minimizes transport distance between selected and target data

🔎 Similar Papers

Effective Subset Selection Through The Lens of Neural Network Pruning