Koo-Fu CLIP: Closed-Form Adaptation of Vision-Language Models via Fukunaga-Koontz Linear Discriminant Analysis

๐Ÿ“… 2026-02-01
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the limitations of original CLIP embeddings in supervised classification, namely insufficient inter-class separability and dimensional redundancy. It introduces the Fukunagaโ€“Koontz Transform (FKT) into vision-language model adaptation for the first time, constructing a closed-form linear projection in a whitened embedding space that effectively suppresses intra-class variation while enhancing inter-class discrimination. The proposed method achieves an integrated optimization of geometric reconstruction, efficient compression, and discriminative enhancement. On ImageNet-1K, it improves the Top-1 accuracy from 75.1% to 79.1%, enables 10โ€“12ร— embedding compression with negligible accuracy loss, and demonstrates consistent gains on larger-scale benchmarks such as ImageNet-14K and ImageNet-21K.

Technology Category

Application Category

๐Ÿ“ Abstract
Visual-language models such as CLIP provide powerful general-purpose representations, but their raw embeddings are not optimized for supervised classification, often exhibiting limited class separation and excessive dimensionality. We propose Koo-Fu CLIP, a supervised CLIP adaptation method based on Fukunaga-Koontz Linear Discriminant Analysis, which operates in a whitened embedding space to suppress within-class variation and enhance between-class discrimination. The resulting closed-form linear projection reshapes the geometry of CLIP embeddings, improving class separability while performing effective dimensionality reduction, and provides a lightweight and efficient adaptation of CLIP representations. Across large-scale ImageNet benchmarks, nearest visual prototype classification in the Koo-Fu CLIP space improves top-1 accuracy from 75.1% to 79.1% on ImageNet-1K, with consistent gains persisting as the label space expands to 14K and 21K classes. The method supports substantial compression by up to 10-12x with little or no loss in accuracy, enabling efficient large-scale classification and retrieval.
Problem

Research questions and friction points this paper is trying to address.

vision-language models
class separation
dimensionality reduction
supervised classification
embedding optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Closed-form adaptation
Fukunaga-Koontz LDA
Vision-language models
Embedding whitening
Dimensionality reduction
๐Ÿ”Ž Similar Papers
No similar papers found.
M
Matej Suchanek
Visual Recognition Group, Department of Cybernetics, Faculty of Electrical Engineering, Czech Technical University in Prague
K
Klara Janouskova
Visual Recognition Group, Department of Cybernetics, Faculty of Electrical Engineering, Czech Technical University in Prague
O
Ondrej Vasatko
Visual Recognition Group, Department of Cybernetics, Faculty of Electrical Engineering, Czech Technical University in Prague
Jiri Matas
Jiri Matas
Professor, Czech Technical University
computer visionimage processingpattern recognitionmachine learning