CoTBal: Comprehensive Task Balancing for Multi-Task Visual Instruction Tuning

📅 2024-03-07
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
In multi-task vision-language instruction tuning, knowledge conflicts across tasks lead to suboptimal overall performance and severe task imbalance. To address this, we propose the Comprehensive Task Balancing (CoTBal) algorithm—the first method to systematically model both cross-task contribution and intra-task difficulty in this domain, enabling knowledge-transfer-aware dynamic task weighting. CoTBal comprises three core components: performance-feedback-driven task difficulty quantification, dynamic weighted task sampling, and multi-task gradient coordination optimization. Extensive experiments on major vision-language understanding and generation benchmarks—including OK-VQA, NoCaps, and VQAv2—demonstrate that CoTBal consistently achieves state-of-the-art or leading performance. It effectively mitigates task interference, improves convergence stability, and enhances generalization across diverse downstream tasks, thereby establishing a new paradigm for balanced multi-task instruction tuning in vision-language models.

Technology Category

Application Category

📝 Abstract
Visual instruction tuning is a key training stage of large multimodal models (LMMs). Nevertheless, the common practice of indiscriminately mixing instruction-following data from various tasks may result in suboptimal overall performance due to different instruction formats and knowledge domains across tasks. To mitigate this issue, we propose a novel Comprehensive Task Balancing (CoTBal) algorithm for multi-task visual instruction tuning of LMMs. To our knowledge, this is the first work that explores multi-task optimization in visual instruction tuning. Specifically, we consider two key dimensions for task balancing: (1) Inter-Task Contribution, the phenomenon where learning one task potentially enhances the performance in other tasks, attributable to the overlapping knowledge domains, and (2) Intra-Task Difficulty, which refers to the learning difficulty within a single task. By quantifying these two dimensions with performance-based metrics, task balancing is thus enabled by assigning more weights to tasks that offer substantial contributions to others, receive minimal contributions from others, and also have great intra-task difficulties. Experiments show that our CoTBal leads to superior overall performance in multi-task visual instruction tuning.
Problem

Research questions and friction points this paper is trying to address.

Addresses suboptimal performance in multi-task visual instruction tuning.
Mitigates imbalanced performance due to latent knowledge conflicts.
Introduces CoTBal algorithm for comprehensive task balancing.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces CoTBal for multi-task visual tuning
Balances tasks using inter-task and intra-task metrics
Assigns weights based on contribution and difficulty
🔎 Similar Papers
No similar papers found.
Y
Yanqi Dai
Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China; Beijing Key Laboratory of Big Data Management and Analysis Methods, Beijing, China
Dong Jing
Dong Jing
Renmin University of China
Computer VisionEmbodied AI
Nanyi Fei
Nanyi Fei
Renmin University of China
Computer VisionMultimodal
Zhiwu Lu
Zhiwu Lu
Professor, Renmin University of China
Machine LearningComputer VisionLarge Multimodal ModelsVideo Generation