HIFICL: High-Fidelity In-Context Learning for Multimodal Tasks

📅 2026-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the sensitivity of large multimodal models to in-context learning (ICL) example configurations and their high computational cost. To this end, the authors propose HiFICL, a high-fidelity ICL framework that models ICL as a dynamic mixture between attention outputs and contextual values, leveraging learnable virtual key-value pairs and low-rank decomposition. Optimized in an end-to-end manner, HiFICL functions as a context-aware parameter-efficient fine-tuning (PEFT) strategy. Extensive experiments demonstrate that HiFICL significantly outperforms existing ICL approximation methods across multiple multimodal benchmarks, achieving both higher performance and greater stability.

Technology Category

Application Category

📝 Abstract
In-Context Learning (ICL) is a significant paradigm for Large Multimodal Models (LMMs), using a few in-context demonstrations (ICDs) for new task adaptation. However, its performance is sensitive to demonstration configurations and computationally expensive. Mathematically, the influence of these demonstrations can be decomposed into a dynamic mixture of the standard attention output and the context values. Current approximation methods simplify this process by learning a "shift vector". Inspired by the exact decomposition, we introduce High-Fidelity In-Context Learning (HIFICL) to more faithfully model the ICL mechanism. HIFICL consists of three key components: 1) a set of "virtual key-value pairs" to act as a learnable context, 2) a low-rank factorization for stable and regularized training, and 3) a simple end-to-end training objective. From another perspective, this mechanism constitutes a form of context-aware Parameter-Efficient Fine-Tuning (PEFT). Extensive experiments show that HiFICL consistently outperforms existing approximation methods on several multimodal benchmarks. The code is available at https://github.com/bbbandari/HiFICL.
Problem

Research questions and friction points this paper is trying to address.

In-Context Learning
Multimodal Tasks
Demonstration Sensitivity
Computational Cost
Large Multimodal Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

In-Context Learning
Multimodal Models
Parameter-Efficient Fine-Tuning
Low-Rank Factorization
Virtual Key-Value Pairs
🔎 Similar Papers
No similar papers found.
Xiaoyu Li
Xiaoyu Li
Hong Kong University of Science and Technology
Deep LearningComputer GraphicsComputational Photography
Yuhang Liu
Yuhang Liu
The University of Adelaide
Representation LearningLLMsLatent Variable ModelsResponsible AI
Zheng Luo
Zheng Luo
PhD student, UCLA
X
Xuanshuo Kang
University of Electronic Science and Technology of China
F
Fangqi Lou
University of Electronic Science and Technology of China
Xiaohua Wu
Xiaohua Wu
University of Electronic Science and Technology of China
Z
Zihan Xiong
University of Electronic Science and Technology of China