Koo-Fu CLIP: Closed-Form Adaptation of Vision-Language Models via Fukunaga-Koontz Linear Discriminant Analysis

📅 2026-02-01

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This work addresses the limitations of original CLIP embeddings in supervised classification, namely insufficient inter-class separability and dimensional redundancy. It introduces the Fukunaga–Koontz Transform (FKT) into vision-language model adaptation for the first time, constructing a closed-form linear projection in a whitened embedding space that effectively suppresses intra-class variation while enhancing inter-class discrimination. The proposed method achieves an integrated optimization of geometric reconstruction, efficient compression, and discriminative enhancement. On ImageNet-1K, it improves the Top-1 accuracy from 75.1% to 79.1%, enables 10–12× embedding compression with negligible accuracy loss, and demonstrates consistent gains on larger-scale benchmarks such as ImageNet-14K and ImageNet-21K.

Technology Category

Application Category

📝 Abstract

Visual-language models such as CLIP provide powerful general-purpose representations, but their raw embeddings are not optimized for supervised classification, often exhibiting limited class separation and excessive dimensionality. We propose Koo-Fu CLIP, a supervised CLIP adaptation method based on Fukunaga-Koontz Linear Discriminant Analysis, which operates in a whitened embedding space to suppress within-class variation and enhance between-class discrimination. The resulting closed-form linear projection reshapes the geometry of CLIP embeddings, improving class separability while performing effective dimensionality reduction, and provides a lightweight and efficient adaptation of CLIP representations. Across large-scale ImageNet benchmarks, nearest visual prototype classification in the Koo-Fu CLIP space improves top-1 accuracy from 75.1% to 79.1% on ImageNet-1K, with consistent gains persisting as the label space expands to 14K and 21K classes. The method supports substantial compression by up to 10-12x with little or no loss in accuracy, enabling efficient large-scale classification and retrieval.

Problem

Research questions and friction points this paper is trying to address.

vision-language models

class separation

dimensionality reduction

supervised classification

embedding optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Closed-form adaptation

Fukunaga-Koontz LDA

Vision-language models