Preserve and Sculpt: Manifold-Aligned Fine-tuning of Vision-Language Models for Few-Shot Learning

📅 2025-08-18

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

Vision-language models (e.g., CLIP) often suffer from semantic manifold distortion during few-shot image classification fine-tuning, degrading both intra-class geometric coherence and inter-class topological separation. Method: We propose Manifold-Preserving and Sculpting Fine-tuning (MPSF), the first approach to jointly preserve both macro-topological (inter-class distribution) and micro-geometric (intra-class structure) properties of the semantic manifold in VLM fine-tuning. MPSF approximates an upper bound of the Gromov–Wasserstein distance via Gram matrix alignment to enforce cross-domain feature geometry consistency; it further integrates vision–language similarity optimization with instance-level consistency constraints to enhance inter-class discriminability and cross-modal alignment. The method operates within a parameter-efficient fine-tuning framework without introducing additional parameters. Results: MPSF achieves significant performance gains across multiple few-shot benchmarks while stably preserving semantic manifold structure, empirically validating the synergy between geometric regularization and discriminative optimization.

Technology Category

Application Category

📝 Abstract

Pretrained vision-language models (VLMs), such as CLIP, have shown remarkable potential in few-shot image classification and led to numerous effective transfer learning strategies. These methods leverage the pretrained knowledge of VLMs to enable effective domain adaptation while mitigating overfitting through parameter-efficient tuning or instance-based consistency constraints. However, such regularizations often neglect the geometric structure of data distribution, which may lead to distortion of the overall semantic representation. To overcome this limitation, we propose a novel fine-tuning method, Manifold-Preserving and Sculpting Tuning (MPS-Tuning). Regarding the data distribution in feature space as a semantic manifold, MPS-Tuning explicitly constrains the intrinsic geometry of this manifold while further sculpting it to enhance class separability. Specifically, MPS-Tuning preserves both macroscopic and microscopic topological structures of the original manifold by aligning Gram matrices of features before and after fine-tuning. Theoretically, this constraint is shown to approximate an upper bound of the Gromov-Wasserstein distance. Furthermore, features from the image and text modalities are paired, and pairwise similarities are optimized to enhance the manifold's class discriminability. Extensive experiments demonstrate that MPS-Tuning significantly improves model performance while effectively preserving the structure of the semantic manifold. The code will be released.

Problem

Research questions and friction points this paper is trying to address.

Preserve geometric structure of data distribution in VLMs

Enhance class separability in few-shot learning

Align multimodal features to improve manifold discriminability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Manifold-Preserving and Sculpting Tuning (MPS-Tuning)

Aligns Gram matrices for topological preservation

Optimizes pairwise similarities for class discriminability

🔎 Similar Papers

No similar papers found.