🤖 AI Summary
To address the problem of catastrophic forgetting of general semantic knowledge and degraded generalization in large vision-language models (VLMs) during prompt tuning—caused by overfitting to task-irrelevant objectives—this paper proposes Feature Matrix (FM) regularization. Our method introduces a plug-and-play, structured feature matrix that explicitly models and disentangles high-level general semantic representations from the model’s deep layers, preserving generic knowledge without modifying the backbone architecture. Technically, FM regularization integrates multi-layer feature extraction, cross-sample semantic alignment, and matrix-based knowledge distillation, and is seamlessly embedded into mainstream prompt-learning frameworks. Evaluated on multiple task-agnostic benchmarks, it achieves state-of-the-art performance while maintaining strong plug-and-play compatibility across diverse VLM architectures. Empirical results demonstrate significant mitigation of overfitting and substantial improvement in task-agnostic generalization capability.
📝 Abstract
Recent developments in prompt learning of large vision-language models have significantly improved performance in target-specific tasks. However, these prompt optimizing methods often struggle to tackle the target-unspecific or generalizable tasks effectively. It may be attributed to the fact that overfitting training causes the model to forget its general knowledge having strong promotion on target-unspecific tasks. To alleviate this issue, we propose a novel Features Matrix (FM) regularization approach designed to enhance these models on target-unspecific tasks. Our method extracts and leverages general knowledge, shaping a Features Matrix (FM). Specifically, the FM captures the semantics of diverse inputs from a deep and fine perspective, preserving essential general knowledge, which mitigates the risk of overfitting. Representative evaluations demonstrate that: 1) the FM is compatible with existing frameworks as a generic and flexible module, and 2) the FM significantly showcases its effectiveness in enhancing target-unspecific tasks, achieving state-of-the-art performance.