DUET: Optimizing Training Data Mixtures via Feedback from Unseen Evaluation Tasks

📅 2025-02-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Optimizing data mixture ratios for unknown downstream tasks remains challenging due to the absence of task-specific supervision during pretraining. Method: This paper proposes DUET, the first algorithm that couples Bayesian optimization with online data selection, dynamically adapting cross-domain training data mixture ratios based on coarse-grained feedback from real downstream evaluations. Contribution/Results: DUET provides theoretical guarantees—namely, convergence and an upper bound on cumulative regret—by integrating online feedback-driven adaptation, cross-domain mixture modeling, and rigorous regret analysis. Empirically, on image classification and large language model evaluation benchmarks, DUET significantly outperforms static mixture baselines, rapidly converging to high-performing data compositions and substantially enhancing cross-task generalization capability.

Technology Category

Application Category

📝 Abstract
The performance of a machine learning (ML) model depends heavily on the relevance of its training data to the domain of the downstream evaluation task. However, in practice, the data involved in an unseen evaluation task is often not known to us (e.g., conversations between an LLM and a user are end-to-end encrypted). So, it is not obvious what data would be relevant for training/fine-tuning the ML model to maximize its task performance. Instead, one can only deploy the ML model in the unseen evaluation task to gather multiple rounds of coarse feedback on how well the model has performed. This paper presents a novel global-to-local algorithm called DUET that can exploit the feedback loop by interleaving a data selection method with Bayesian optimization. As a result, DUET can efficiently refine the training data mixture from a pool of data domains to maximize the model's performance on the unseen evaluation task and its convergence to the optimal data mixture can be theoretically guaranteed by analyzing its cumulative regret. Empirical evaluation on image and LLM evaluation tasks shows that DUET finds better training data mixtures than conventional baselines.
Problem

Research questions and friction points this paper is trying to address.

Machine Learning
Data Optimization
Unknown Task Prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Data Selection
Task-agnostic Learning
Performance Enhancement
Z
Zhiliang Chen
Department of Computer Science, National University of Singapore, Singapore; Institute for Infocomm Research, A*STAR, Singapore
Gregory Kang Ruey Lau
Gregory Kang Ruey Lau
National University of Singapore
data-centric AImultimodal large language modelsmachine learningdeep learningphysics
C
Chuan-Sheng Foo
Institute for Infocomm Research, A*STAR, Singapore; Centre for Frontier AI Research, A*STAR, Singapore
B
B. Low
Department of Computer Science, National University of Singapore, Singapore