Swiss Army Knife: Synergizing Biases in Knowledge from Vision Foundation Models for Multi-Task Learning

📅 2024-10-18

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Single visual foundation models (VFMs) exhibit representation bias and imbalanced task performance in multi-task vision learning due to divergent training paradigms. Method: We propose a lightweight collaborative multi-VFM framework featuring a novel hybrid architecture—task-specific adapter paths guided by teacher models, coupled with a teacher-agnostic shared backbone—and a representation-mixing router that dynamically fuses biased representations from multiple VFMs to enable complementary knowledge distillation. Contribution/Results: Crucially, we are the first to explicitly model inherent VFM biases as exploitable priors rather than nuisances to be eliminated. On the NYUD-v2 multi-task benchmark, our method surpasses state-of-the-art approaches by 10% with significantly fewer parameters. Its modular design enables plug-and-play integration of new VFMs, ensuring both computational efficiency and strong generalization across diverse tasks.

Technology Category

Application Category

📝 Abstract

Vision Foundation Models (VFMs) have demonstrated outstanding performance on numerous downstream tasks. However, due to their inherent representation biases originating from different training paradigms, VFMs exhibit advantages and disadvantages across distinct vision tasks. Although amalgamating the strengths of multiple VFMs for downstream tasks is an intuitive strategy, effectively exploiting these biases remains a significant challenge. In this paper, we propose a novel and versatile"Swiss Army Knife"(SAK) solution, which adaptively distills knowledge from a committee of VFMs to enhance multi-task learning. Unlike existing methods that use a single backbone for knowledge transfer, our approach preserves the unique representation bias of each teacher by collaborating the lightweight Teacher-Specific Adapter Path modules with the Teacher-Agnostic Stem. Through dynamic selection and combination of representations with Mixture-of-Representations Routers, our SAK is capable of synergizing the complementary strengths of multiple VFMs. Extensive experiments show that our SAK remarkably outperforms prior state of the arts in multi-task learning by 10% on the NYUD-v2 benchmark, while also providing a flexible and robust framework that can readily accommodate more advanced model designs. Project page: https://innovator-zero.github.io/SAK/ .

Problem

Research questions and friction points this paper is trying to address.

Addressing representation biases in Vision Foundation Models

Enhancing multi-task learning by combining multiple VFMs

Developing a flexible framework for dynamic knowledge distillation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive knowledge distillation from multiple VFMs

Dynamic representation selection via Mixture-of-Representations Routers

Lightweight Teacher-Specific Adapter Path modules

🔎 Similar Papers

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks