Multi-layer Cross-Attention is Provably Optimal for Multi-modal In-context Learning

📅 2026-02-04

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This work addresses the theoretical limitations of single-layer self-attention models in multimodal in-context learning, which fail to achieve Bayes-optimal performance for reasons not yet well understood. To elucidate this issue, we develop an analytically tractable framework for multimodal in-context learning and propose a multi-layer linearized cross-attention mechanism. By integrating latent factor models, gradient flow dynamics, and large-context asymptotic theory, we rigorously demonstrate that single-layer self-attention cannot consistently attain Bayes optimality. In contrast, our proposed multi-layer architecture, under conditions of sufficient depth and large context size, achieves exact Bayes-optimal prediction through optimized gradient flow. This study provides the first theoretical evidence highlighting the critical role of deep architectural structure in enabling optimal multimodal in-context learning.

Technology Category

Application Category

📝 Abstract

Recent progress has rapidly advanced our understanding of the mechanisms underlying in-context learning in modern attention-based neural networks. However, existing results focus exclusively on unimodal data; in contrast, the theoretical underpinnings of in-context learning for multi-modal data remain poorly understood. We introduce a mathematically tractable framework for studying multi-modal learning and explore when transformer-like architectures can recover Bayes-optimal performance in-context. To model multi-modal problems, we assume the observed data arises from a latent factor model. Our first result comprises a negative take on expressibility: we prove that single-layer, linear self-attention fails to recover the Bayes-optimal predictor uniformly over the task distribution. To address this limitation, we introduce a novel, linearized cross-attention mechanism, which we study in the regime where both the number of cross-attention layers and the context length are large. We show that this cross-attention mechanism is provably Bayes optimal when optimized using gradient flow. Our results underscore the benefits of depth for in-context learning and establish the provable utility of cross-attention for multi-modal distributions.

Problem

Research questions and friction points this paper is trying to address.

multi-modal

in-context learning

Bayes-optimal

cross-attention

transformer

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-modal in-context learning

cross-attention

Bayes optimality