๐ค AI Summary
This work addresses the limitations of original CLIP embeddings in supervised classification, namely insufficient inter-class separability and dimensional redundancy. It introduces the FukunagaโKoontz Transform (FKT) into vision-language model adaptation for the first time, constructing a closed-form linear projection in a whitened embedding space that effectively suppresses intra-class variation while enhancing inter-class discrimination. The proposed method achieves an integrated optimization of geometric reconstruction, efficient compression, and discriminative enhancement. On ImageNet-1K, it improves the Top-1 accuracy from 75.1% to 79.1%, enables 10โ12ร embedding compression with negligible accuracy loss, and demonstrates consistent gains on larger-scale benchmarks such as ImageNet-14K and ImageNet-21K.
๐ Abstract
Visual-language models such as CLIP provide powerful general-purpose representations, but their raw embeddings are not optimized for supervised classification, often exhibiting limited class separation and excessive dimensionality. We propose Koo-Fu CLIP, a supervised CLIP adaptation method based on Fukunaga-Koontz Linear Discriminant Analysis, which operates in a whitened embedding space to suppress within-class variation and enhance between-class discrimination. The resulting closed-form linear projection reshapes the geometry of CLIP embeddings, improving class separability while performing effective dimensionality reduction, and provides a lightweight and efficient adaptation of CLIP representations. Across large-scale ImageNet benchmarks, nearest visual prototype classification in the Koo-Fu CLIP space improves top-1 accuracy from 75.1% to 79.1% on ImageNet-1K, with consistent gains persisting as the label space expands to 14K and 21K classes. The method supports substantial compression by up to 10-12x with little or no loss in accuracy, enabling efficient large-scale classification and retrieval.