Knowledge Transfer from Interaction Learning

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision foundation models (VFMs) adopt an outcome-oriented paradigm, hindering effective transfer of cross-modal interaction knowledge embedded in vision-language models (VLMs), thereby limiting generalization. To address this, we propose an interaction-aware knowledge distillation framework that introduces inter-layer persistent interaction queries. These queries explicitly extract cross-modal interaction supervision signals from VLMs via cross-attention and enable dynamic, cognition-aligned knowledge transfer within a unified representation space. Crucially, our approach decouples distillation from output-level matching—bypassing reliance on final predictions—and is the first to incorporate explicit modeling of interaction dynamics into VFM training. Experiments demonstrate significant performance gains on TinyImageNet and COCO; zero-shot cross-domain generalization improves by 2.4% on PACS and 9.3% on VLCS. Moreover, training convergence accelerates while incurring negligible parameter overhead.

Technology Category

Application Category

📝 Abstract
Current visual foundation models (VFMs) face a fundamental limitation in transferring knowledge from vision language models (VLMs), while VLMs excel at modeling cross-modal interactions through unified representation spaces, existing VFMs predominantly adopt result-oriented paradigms that neglect the underlying interaction processes. This representational discrepancy hinders effective knowledge transfer and limits generalization across diverse vision tasks. We propose Learning from Interactions (LFI), a cognitive-inspired framework that addresses this gap by explicitly modeling visual understanding as an interactive process. Our key insight is that capturing the dynamic interaction patterns encoded in pre-trained VLMs enables more faithful and efficient knowledge transfer to VFMs. The approach centers on two technical innovations, Interaction Queries, which maintain persistent relational structures across network layers, and interaction-based supervision, derived from the cross-modal attention mechanisms of VLMs. Comprehensive experiments demonstrate consistent improvements across multiple benchmarks, achieving 3.3 and 1.6mAP/2.4AP absolute gains on TinyImageNet classification and COCO detection/segmentation respectively, with minimal parameter overhead and faster convergence. The framework particularly excels in cross-domain settings, delivering 2.4 and 9.3 zero-shot improvements on PACS and VLCS. Human evaluations further confirm its cognitive alignment, outperforming result-oriented methods by 2.7 times in semantic consistency metrics.
Problem

Research questions and friction points this paper is trying to address.

Transferring knowledge from vision-language models to visual foundation models
Modeling visual understanding as interactive processes for better generalization
Addressing representational discrepancies in cross-modal interaction learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modeling visual understanding as interactive process
Using Interaction Queries across network layers
Applying interaction-based supervision from VLMs
🔎 Similar Papers
No similar papers found.
Y
Yilin Gao
Shanghai University
K
Kangyi Chen
Shanghai University
Z
Zhongxing Peng
Shanghai University
H
Hengjie Lu
Shanghai University
Shugong Xu
Shugong Xu
Professor at Xi'an Jiaotong-Liverpool University, IEEE Fellow
Machine LearningPattern RecognitionWireless Systems