Dynamic Embedding of Hierarchical Visual Features for Efficient Vision-Language Fine-Tuning

📅 2025-08-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the excessive input length and high computational cost in Large Vision-Language Models (LVLMs) caused by concatenating visual and textual sequences, this paper proposes a dynamic multi-level visual feature embedding method. Instead of expanding the input sequence, it introduces a lightweight hierarchical fusion module that dynamically selects fine-grained visual features from multiple layers of the vision encoder—according to semantic granularity—projects them into alignment with language representations, and injects them into intermediate feed-forward network (FFN) layers of the language model. This is the first approach to explicitly model hierarchical semantic alignment between vision and language representations, jointly preserving low-level visual details and high-level semantics. Integrated with parameter-efficient fine-tuning (PEFT), the method achieves significant improvements over existing PEFT baselines on benchmarks including ScienceQA and COCO Captions, delivering simultaneous gains in accuracy, training efficiency, and inference efficiency.

Technology Category

Application Category

📝 Abstract
Large Vision-Language Models (LVLMs) commonly follow a paradigm that projects visual features and then concatenates them with text tokens to form a unified sequence input for Large Language Models (LLMs). However, this paradigm leads to a significant increase in the length of the input sequence, resulting in substantial computational overhead. Existing methods attempt to fuse visual information into the intermediate layers of LLMs, which alleviate the sequence length issue but often neglect the hierarchical semantic representations within the model and the fine-grained visual information available in the shallower visual encoding layers. To address this limitation, we propose DEHVF, an efficient vision-language fine-tuning method based on dynamic embedding and fusion of hierarchical visual features. Its core lies in leveraging the inherent hierarchical representation characteristics of visual encoders and language models. Through a lightweight hierarchical visual fuser, it dynamically selects and fuses hierarchical features corresponding to semantic granularity based on the internal representations of each layer in LLMs. The fused layer-related visual features are then projected and aligned before being directly embedded into the Feed-Forward Network (FFN) of the corresponding layer in LLMs. This approach not only avoids sequence expansion but also dynamically fuses multi-layer visual information. By fine-tuning only a small number of parameters, DEHVF achieves precise alignment and complementarity of cross-modal information at the same semantic granularity. We conducted experiments across various VL benchmarks, including visual question answering on ScienceQA and image captioning on COCO Captions. The results demonstrate that DEHVF achieves higher accuracy than existing parameter-efficient fine-tuning (PEFT) baselines while maintaining efficient training and inference.
Problem

Research questions and friction points this paper is trying to address.

Reduces computational overhead from long input sequences
Dynamically fuses hierarchical visual features with language models
Achieves precise cross-modal alignment with minimal parameter tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic hierarchical visual feature fusion
Embedding into Feed-Forward Network layers
Lightweight parameter-efficient fine-tuning method
🔎 Similar Papers
No similar papers found.
Xinyu Wei
Xinyu Wei
PolyU & PKU
Computer VisionDeep Learning
G
Guoli Yang
Advanced Institute of Big Data, Beijing, China
J
Jialu Zhou
College of Computer Science and Technology, National University of Defense Technology, China
M
Mingyue Yang
College of Computer Science and Technology, National University of Defense Technology, China
L
Leqian Li
College of Computer Science and Technology, National University of Defense Technology, China
K
Kedi Zhang
College of Computer Science and Technology, National University of Defense Technology, China
C
Chunping Qiu
Intelligent Game and Decision Lab, Beijing 100091, China