🤖 AI Summary
In multi-task vision-language instruction tuning, knowledge conflicts across tasks lead to suboptimal overall performance and severe task imbalance. To address this, we propose the Comprehensive Task Balancing (CoTBal) algorithm—the first method to systematically model both cross-task contribution and intra-task difficulty in this domain, enabling knowledge-transfer-aware dynamic task weighting. CoTBal comprises three core components: performance-feedback-driven task difficulty quantification, dynamic weighted task sampling, and multi-task gradient coordination optimization. Extensive experiments on major vision-language understanding and generation benchmarks—including OK-VQA, NoCaps, and VQAv2—demonstrate that CoTBal consistently achieves state-of-the-art or leading performance. It effectively mitigates task interference, improves convergence stability, and enhances generalization across diverse downstream tasks, thereby establishing a new paradigm for balanced multi-task instruction tuning in vision-language models.
📝 Abstract
Visual instruction tuning is a key training stage of large multimodal models (LMMs). Nevertheless, the common practice of indiscriminately mixing instruction-following data from various tasks may result in suboptimal overall performance due to different instruction formats and knowledge domains across tasks. To mitigate this issue, we propose a novel Comprehensive Task Balancing (CoTBal) algorithm for multi-task visual instruction tuning of LMMs. To our knowledge, this is the first work that explores multi-task optimization in visual instruction tuning. Specifically, we consider two key dimensions for task balancing: (1) Inter-Task Contribution, the phenomenon where learning one task potentially enhances the performance in other tasks, attributable to the overlapping knowledge domains, and (2) Intra-Task Difficulty, which refers to the learning difficulty within a single task. By quantifying these two dimensions with performance-based metrics, task balancing is thus enabled by assigning more weights to tasks that offer substantial contributions to others, receive minimal contributions from others, and also have great intra-task difficulties. Experiments show that our CoTBal leads to superior overall performance in multi-task visual instruction tuning.