🤖 AI Summary
To address the challenge that Vision Transformers (ViTs) often obscure cross-modal complementary information in multi-channel imaging (MCI) data—such as medical and remote sensing imagery—this paper proposes Isolated-Channel ViT (IC-ViT). Methodologically, IC-ViT introduces a novel channel-independent patchification mechanism to decouple inter-modal interference, supports large-scale single-channel pre-training followed by cross-modal multi-channel fine-tuning, and jointly models local patch representations and inter-channel dependencies. It employs channel-isolated patch embedding, a weight-shared encoder, and a progressive multi-channel fusion strategy to accommodate heterogeneous MCI inputs. Evaluated on JUMP-CP, CHAMMI, and So2Sat-LCZ42 benchmarks, IC-ViT outperforms existing channel-adaptive methods by 4–14 percentage points, demonstrating substantial improvements in classification accuracy. This work advances scalable pre-training paradigms for foundational MCI models.
📝 Abstract
Vision Transformers (ViTs) have achieved remarkable success in standard RGB image processing tasks. However, applying ViTs to multi-channel imaging (MCI) data, e.g., for medical and remote sensing applications, remains a challenge. In particular, MCI data often consist of layers acquired from different modalities. Directly training ViTs on such data can obscure complementary information and impair the performance. In this paper, we introduce a simple yet effective pretraining framework for large-scale MCI datasets. Our method, named Isolated Channel ViT (IC-ViT), patchifies image channels individually and thereby enables pretraining for multimodal multi-channel tasks. We show that this channel-wise patchifying is a key technique for MCI processing. More importantly, one can pretrain the IC-ViT on single channels and finetune it on downstream multi-channel datasets. This pretraining framework captures dependencies between patches as well as channels and produces robust feature representation. Experiments on various tasks and benchmarks, including JUMP-CP and CHAMMI for cell microscopy imaging, and So2Sat-LCZ42 for satellite imaging, show that the proposed IC-ViT delivers 4-14 percentage points of performance improvement over existing channel-adaptive approaches. Further, its efficient training makes it a suitable candidate for large-scale pretraining of foundation models on heterogeneous data.