Preserve and Sculpt: Manifold-Aligned Fine-tuning of Vision-Language Models for Few-Shot Learning

📅 2025-08-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language models (e.g., CLIP) often suffer from semantic manifold distortion during few-shot image classification fine-tuning, degrading both intra-class geometric coherence and inter-class topological separation. Method: We propose Manifold-Preserving and Sculpting Fine-tuning (MPSF), the first approach to jointly preserve both macro-topological (inter-class distribution) and micro-geometric (intra-class structure) properties of the semantic manifold in VLM fine-tuning. MPSF approximates an upper bound of the Gromov–Wasserstein distance via Gram matrix alignment to enforce cross-domain feature geometry consistency; it further integrates vision–language similarity optimization with instance-level consistency constraints to enhance inter-class discriminability and cross-modal alignment. The method operates within a parameter-efficient fine-tuning framework without introducing additional parameters. Results: MPSF achieves significant performance gains across multiple few-shot benchmarks while stably preserving semantic manifold structure, empirically validating the synergy between geometric regularization and discriminative optimization.

Technology Category

Application Category

📝 Abstract
Pretrained vision-language models (VLMs), such as CLIP, have shown remarkable potential in few-shot image classification and led to numerous effective transfer learning strategies. These methods leverage the pretrained knowledge of VLMs to enable effective domain adaptation while mitigating overfitting through parameter-efficient tuning or instance-based consistency constraints. However, such regularizations often neglect the geometric structure of data distribution, which may lead to distortion of the overall semantic representation. To overcome this limitation, we propose a novel fine-tuning method, Manifold-Preserving and Sculpting Tuning (MPS-Tuning). Regarding the data distribution in feature space as a semantic manifold, MPS-Tuning explicitly constrains the intrinsic geometry of this manifold while further sculpting it to enhance class separability. Specifically, MPS-Tuning preserves both macroscopic and microscopic topological structures of the original manifold by aligning Gram matrices of features before and after fine-tuning. Theoretically, this constraint is shown to approximate an upper bound of the Gromov-Wasserstein distance. Furthermore, features from the image and text modalities are paired, and pairwise similarities are optimized to enhance the manifold's class discriminability. Extensive experiments demonstrate that MPS-Tuning significantly improves model performance while effectively preserving the structure of the semantic manifold. The code will be released.
Problem

Research questions and friction points this paper is trying to address.

Preserve geometric structure of data distribution in VLMs
Enhance class separability in few-shot learning
Align multimodal features to improve manifold discriminability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Manifold-Preserving and Sculpting Tuning (MPS-Tuning)
Aligns Gram matrices for topological preservation
Optimizes pairwise similarities for class discriminability
🔎 Similar Papers
No similar papers found.
D
Dexia Chen
School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
Q
Qianjie Zhu
School of Computer, Electronics and Information, Guangxi University, Nanning, China
Weibing Li
Weibing Li
School of Computer Science and Engineering, Sun Yat-sen University
Neural NetworksRoboticsAutomatic Control
Y
Yue Yu
Peng Cheng Laboratory, Shenzhen, China
T
Tong Zhang
Peng Cheng Laboratory, Shenzhen, China
R
Ruixuan Wang
School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China