Understanding Task Transfer in Vision-Language Models

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Vision-language models (VLMs) excel at general multimodal tasks but underperform significantly on fine-grained visual perception tasks—e.g., depth estimation and object counting—compared to task-specific models; moreover, single-task fine-tuning often induces unpredictable cross-task performance degradation. Method: We systematically investigate zero-shot transfer behavior across 13 perception tasks, proposing the Perfection Gap Factor (PGF)—a quantifiable metric for transfer effects—and constructing a task transfer graph to uncover positive/negative transfer patterns and latent task relationships. Based on transfer dynamics, we categorize tasks into roles such as “hubs,” “facilitators,” and “inhibitees” to guide efficient data selection. Contribution/Results: Evaluated on three mainstream open-source VLMs, PGF robustly models transfer regularities in multi-task assessment, providing an interpretable, actionable theoretical foundation and practical framework for targeted VLM optimization and data-efficient fine-tuning.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) perform well on multimodal benchmarks but lag behind humans and specialized models on visual perception tasks like depth estimation or object counting. Finetuning on one task can unpredictably affect performance on others, making task-specific finetuning challenging. In this paper, we address this challenge through a systematic study of task transferability. We examine how finetuning a VLM on one perception task affects its zero-shot performance on others. To quantify these effects, we introduce Perfection Gap Factor (PGF), a metric that captures both the breadth and magnitude of transfer. Using three open-weight VLMs evaluated across 13 perception tasks, we construct a task-transfer graph that reveals previously unobserved relationships among perception tasks. Our analysis uncovers patterns of positive and negative transfer, identifies groups of tasks that mutually influence each other, organizes tasks into personas based on their transfer behavior and demonstrates how PGF can guide data selection for more efficient training. These findings highlight both opportunities for positive transfer and risks of negative interference, offering actionable guidance for advancing VLMs.

Problem

Research questions and friction points this paper is trying to address.

Studying how finetuning affects zero-shot performance across visual tasks

Quantifying transfer effects between perception tasks using new metrics

Identifying patterns of positive and negative transfer between vision tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Perfection Gap Factor transfer metric

Constructs task-transfer graph revealing task relationships

Organizes tasks into personas based transfer behavior

🔎 Similar Papers

VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks