T2T-VICL: Unlocking the Boundaries of Cross-Task Visual In-Context Learning via Implicit Text-Driven VLMs

📅 2025-11-20

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work investigates whether vision-language models (VLMs) can perform cross-task visual in-context learning (VICL), where visual prompts and target images belong to distinct low-level vision tasks (e.g., edge detection → semantic segmentation). To address this, we propose T2T-VICL, a collaborative framework comprising: (1) the first benchmark dataset explicitly designed for cross-task VICL; (2) a text-to-text–driven implicit knowledge transfer mechanism; and (3) a perception-score–guided inference strategy that overcomes the conventional limitation of VICL to same-task settings. Our method integrates prompt generation and selection, perception-driven reasoning, and multi-metric joint evaluation. Experiments span 19 cross-task scenarios: T2T-VICL achieves state-of-the-art performance on 9 tasks and ranks second on 10 others, demonstrating substantial improvements in VLM generalization across heterogeneous vision tasks and zero-shot transfer capability.

Technology Category

Application Category

📝 Abstract

In large language models (LLM), in-context learning (ICL) refers to performing new tasks by conditioning on small demonstrations provided in the input context. Recent advances in visual in-context learning (VICL) demonstrate promising capabilities for solving downstream tasks by unified vision-language models (VLMs). When the visual prompt and the target images originate from different visual tasks, can VLMs still enable VICL? In the paper, we propose a fully collaborative pipeline, i.e. T2T-VICL, for VLMs to investigate the potential of cross-task VICL. Fundamentally, we design a mechanism to generate and select text prompts that best implicitly describe the differences between two distinct low-level vision tasks, and construct the first cross-task VICL dataset. Building upon this, we propose a novel inference framework that combines perceptual score-based reasoning with traditional evaluation metrics to perform cross-task VICL. Our approach achieves top-tier results across nine cross-task scenarios and second-tier performance in ten additional scenarios, unlocking the boundaries of cross-task VICL within VLMs.

Problem

Research questions and friction points this paper is trying to address.

Enabling cross-task visual in-context learning for vision-language models

Generating text prompts to describe differences between vision tasks

Developing inference framework combining perceptual and metric evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates text prompts describing task differences implicitly

Constructs first cross-task visual in-context learning dataset

Combines perceptual reasoning with traditional evaluation metrics

🔎 Similar Papers

No similar papers found.