Enhancing Multimodal In-Context Learning via Inductive-Deductive Reasoning

📅 2026-05-04

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the challenges in multimodal in-context learning, where models often rely on spurious correlations to arrive at correct answers due to an inductive gap, struggle to extract consistent rules from examples, and suffer from interference caused by redundant visual tokens and attention bias. To overcome these limitations, the study introduces, for the first time, a systematic inductive–deductive reasoning framework that integrates similarity-driven visual token compression, dynamic attention rebalancing, chain-of-thought prompting, and reinforcement learning with verifiable rewards. This unified approach substantially enhances reasoning faithfulness and generalization across diverse tasks. Evaluated on eight benchmarks spanning visual perception, logical reasoning, STEM, and sarcasm detection, multiple open-source vision-language models consistently outperform standard in-context learning baselines.

📝 Abstract

In-context learning (ICL) allows large models to adapt to tasks using a few examples, yet its extension to vision-language models (VLMs) remains fragile. Our analysis reveals that the fundamental limitation lies in an inductive gap, models often produce correct answers from flawed reasoning, while struggling to extract consistent rules across demonstrations. This gap is further exacerbated by two visual-level obstacles: an overwhelming proportion of redundant visual tokens that obscure textual cues, and a skewed attention distribution that favors the initial image at the expense of subsequent context. To address these issues, we introduce a framework that restructures multimodal ICL as a principled inductive-deductive process. The framework incorporates a similarity-based visual token compression module to filter out redundant patches, a dynamic attention rebalancing mechanism to distribute focus equitably across all images, and a chain-of-thought paradigm that explicitly guides the model to analyze individual examples, derive a generalizable rule, and then apply it to the query. An auxiliary learning pipeline combines supervised fine-tuning with reinforcement learning using verifiable rewards to reinforce faithful citation and noise filtering. Evaluations across eight benchmarks covering visual perception, logical reasoning, STEM problems, and sarcasm detection demonstrate consistent and significant improvements over standard ICL baselines for multiple open-source VLMs, highlighting the potential of equipping models with genuine inductive capabilities in multimodal settings.

Problem

Research questions and friction points this paper is trying to address.

in-context learning

vision-language models

inductive reasoning

visual tokens

attention distribution

Innovation

Methods, ideas, or system contributions that make the work stand out.

inductive-deductive reasoning

visual token compression

dynamic attention rebalancing