Is Visual in-Context Learning for Compositional Medical Tasks within Reach?

πŸ“… 2025-07-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limited generalization and test-time flexibility of models in multi-step, complex medical vision tasks. We propose Visual Context Learning (VCL), a framework enabling a single model to dynamically compose multi-task pipelines at test time without fine-tuning. Methodologically, we design a synthetic task generation engine that bootstraps diverse task sequences from arbitrary segmentation datasets; we further introduce a masked training objective and a codebook mechanism to explicitly model inter-task dependencies and enhance parsing and execution of composite instructions. Experiments demonstrate that VCL significantly improves cross-task generalization across multimodal medical imaging workflows (e.g., localization β†’ segmentation β†’ classification β†’ report generation) and, for the first time, enables end-to-end execution of user-defined visual pipelines at inference time. The study also identifies VCL’s current limitations in modeling long-range task dependencies and establishes a new paradigm for editable, composable medical AI systems.

Technology Category

Application Category

πŸ“ Abstract
In this paper, we explore the potential of visual in-context learning to enable a single model to handle multiple tasks and adapt to new tasks during test time without re-training. Unlike previous approaches, our focus is on training in-context learners to adapt to sequences of tasks, rather than individual tasks. Our goal is to solve complex tasks that involve multiple intermediate steps using a single model, allowing users to define entire vision pipelines flexibly at test time. To achieve this, we first examine the properties and limitations of visual in-context learning architectures, with a particular focus on the role of codebooks. We then introduce a novel method for training in-context learners using a synthetic compositional task generation engine. This engine bootstraps task sequences from arbitrary segmentation datasets, enabling the training of visual in-context learners for compositional tasks. Additionally, we investigate different masking-based training objectives to gather insights into how to train models better for solving complex, compositional tasks. Our exploration not only provides important insights especially for multi-modal medical task sequences but also highlights challenges that need to be addressed.
Problem

Research questions and friction points this paper is trying to address.

Explores visual in-context learning for multi-task adaptation
Trains models to handle sequential, not individual, complex tasks
Develops synthetic task generation for compositional medical tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual in-context learning for multi-task adaptation
Synthetic compositional task generation engine
Masking-based training objectives for complex tasks
πŸ”Ž Similar Papers
No similar papers found.