🤖 AI Summary
This study investigates how Transformers establish cross-modal associations through in-context examples in multimodal in-context learning (ICL). By training small-scale Transformers on synthetic classification tasks and integrating controlled data generation, RoPE (Rotary Position Embedding) analysis, mechanistic interpretability, and circuit tracing, the work systematically compares unimodal and multimodal ICL mechanisms. It reveals, for the first time, a pronounced modality asymmetry in multimodal ICL: when the primary modality is pretrained with high diversity, the secondary modality can trigger ICL with extremely low data complexity. The study further identifies that this phenomenon relies on an enhanced inductive label-copying mechanism. Additionally, the authors introduce the first controllable benchmark platform dedicated to multimodal ICL and demonstrate the critical role of RoPE in determining the data complexity threshold required for effective ICL.
📝 Abstract
Transformer-based multimodal large language models often exhibit in-context learning (ICL) abilities. Motivated by this phenomenon, we ask: how do transformers learn to associate information across modalities from in-context examples? We investigate this question through controlled experiments on small transformers trained on synthetic classification tasks, enabling precise manipulation of data statistics and model architecture. We begin by revisiting core principles of unimodal ICL in modern transformers. While several prior findings replicate, we find that Rotary Position Embeddings (RoPE) increases the data complexity threshold for ICL. Extending to the multimodal setting reveals a fundamental learning asymmetry: when pretrained on high-diversity data from a primary modality, surprisingly low data complexity in the secondary modality suffices for multimodal ICL to emerge. Mechanistic analysis shows that both settings rely on an induction-style mechanism that copies labels from matching in-context exemplars; multimodal training refines and extends these circuits across modalities. Our findings provide a mechanistic foundation for understanding multimodal ICL in modern transformers and introduce a controlled testbed for future investigation.