π€ AI Summary
This work addresses the limited generalization and test-time flexibility of models in multi-step, complex medical vision tasks. We propose Visual Context Learning (VCL), a framework enabling a single model to dynamically compose multi-task pipelines at test time without fine-tuning. Methodologically, we design a synthetic task generation engine that bootstraps diverse task sequences from arbitrary segmentation datasets; we further introduce a masked training objective and a codebook mechanism to explicitly model inter-task dependencies and enhance parsing and execution of composite instructions. Experiments demonstrate that VCL significantly improves cross-task generalization across multimodal medical imaging workflows (e.g., localization β segmentation β classification β report generation) and, for the first time, enables end-to-end execution of user-defined visual pipelines at inference time. The study also identifies VCLβs current limitations in modeling long-range task dependencies and establishes a new paradigm for editable, composable medical AI systems.
π Abstract
In this paper, we explore the potential of visual in-context learning to enable a single model to handle multiple tasks and adapt to new tasks during test time without re-training. Unlike previous approaches, our focus is on training in-context learners to adapt to sequences of tasks, rather than individual tasks. Our goal is to solve complex tasks that involve multiple intermediate steps using a single model, allowing users to define entire vision pipelines flexibly at test time. To achieve this, we first examine the properties and limitations of visual in-context learning architectures, with a particular focus on the role of codebooks. We then introduce a novel method for training in-context learners using a synthetic compositional task generation engine. This engine bootstraps task sequences from arbitrary segmentation datasets, enabling the training of visual in-context learners for compositional tasks. Additionally, we investigate different masking-based training objectives to gather insights into how to train models better for solving complex, compositional tasks. Our exploration not only provides important insights especially for multi-modal medical task sequences but also highlights challenges that need to be addressed.