🤖 AI Summary
In the era of foundation models, large-scale image retrieval faces the challenge of learning hash representations that simultaneously achieve compactness and discriminability. To address this, we propose CroVCA—a cross-view code alignment framework—that replaces multi-objective optimization and complex pipelines with a single binary cross-entropy loss jointly regularized by coding-rate maximization, thereby unifying binary code alignment and diversity control. We design a lightweight HashCoder MLP network incorporating batch normalization and LoRA-based fine-tuning, enabling efficient encoder adaptation while freezing backbone features. Evaluated on standard benchmarks, CroVCA achieves state-of-the-art performance within only five training epochs: for 16-bit hashing, it requires less than two minutes for unsupervised hashing on COCO and approximately three minutes for supervised hashing on ImageNet-100—significantly improving both training efficiency and retrieval accuracy.
📝 Abstract
Efficient large-scale retrieval requires representations that are both compact and discriminative. Foundation models provide powerful visual and multimodal embeddings, but nearest neighbor search in these high-dimensional spaces is computationally expensive. Hashing offers an efficient alternative by enabling fast Hamming distance search with binary codes, yet existing approaches often rely on complex pipelines, multi-term objectives, designs specialized for a single learning paradigm, and long training times. We introduce CroVCA (Cross-View Code Alignment), a simple and unified principle for learning binary codes that remain consistent across semantically aligned views. A single binary cross-entropy loss enforces alignment, while coding-rate maximization serves as an anti-collapse regularizer to promote balanced and diverse codes. To implement this, we design HashCoder, a lightweight MLP hashing network with a final batch normalization layer to enforce balanced codes. HashCoder can be used as a probing head on frozen embeddings or to adapt encoders efficiently via LoRA fine-tuning. Across benchmarks, CroVCA achieves state-of-the-art results in just 5 training epochs. At 16 bits, it particularly well-for instance, unsupervised hashing on COCO completes in under 2 minutes and supervised hashing on ImageNet100 in about 3 minutes on a single GPU. These results highlight CroVCA's efficiency, adaptability, and broad applicability.