TACO: Enhancing Multimodal In-context Learning via Task Mapping-Guided Sequence Configuration

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Multimodal in-context learning (ICL) performance is highly sensitive to the quality of example sequences, and its underlying inference mechanisms in large vision-language models (LVLMs) remain opaque—particularly hindering complex reasoning and open-ended generation tasks. To address this, we propose, for the first time, a “task-mapping” perspective that jointly models example sequence construction and internal model reasoning as a bidirectional co-adaptive process. Methodologically, we design a lightweight Transformer architecture incorporating task-aware attention and task-mapping signal injection into autoregressive decoding, enabling dynamic, task-adaptive sequence configuration. Evaluated across five mainstream LVLMs and nine benchmark datasets, our approach consistently outperforms strong baselines, with particularly pronounced gains on complex reasoning and open-ended generation tasks. This work establishes a new, interpretable, and optimization-friendly paradigm for multimodal ICL.

Technology Category

Application Category

📝 Abstract

Multimodal in-context learning (ICL) has emerged as a key mechanism for harnessing the capabilities of large vision-language models (LVLMs). However, its effectiveness remains highly sensitive to the quality of input in-context sequences, particularly for tasks involving complex reasoning or open-ended generation. A major limitation is our limited understanding of how LVLMs actually exploit these sequences during inference. To bridge this gap, we systematically interpret multimodal ICL through the lens of task mapping, which reveals how local and global relationships within and among demonstrations guide model reasoning. Building on this insight, we present TACO, a lightweight transformer-based model equipped with task-aware attention that dynamically configures in-context sequences. By injecting task-mapping signals into the autoregressive decoding process, TACO creates a bidirectional synergy between sequence construction and task reasoning. Experiments on five LVLMs and nine datasets demonstrate that TACO consistently surpasses baselines across diverse ICL tasks. These results position task mapping as a valuable perspective for interpreting and improving multimodal ICL.

Problem

Research questions and friction points this paper is trying to address.

Understanding how LVLMs exploit in-context sequences during inference

Improving multimodal ICL effectiveness for complex reasoning tasks

Dynamically configuring in-context sequences via task-aware attention

Innovation

Methods, ideas, or system contributions that make the work stand out.

Task-aware attention dynamically configures sequences

Task-mapping signals guide autoregressive decoding

Lightweight transformer enhances multimodal ICL

🔎 Similar Papers

From Introspection to Best Practices: Principled Analysis of Demonstrations in Multimodal In-Context Learning